Put Yourself Out Therehttp://acbecker.github.io/2014-10-14T18:41:00-07:00Postgres Table Design2014-10-14T18:41:00-07:00Andrew Beckertag:acbecker.github.io,2014-10-14:blog/2014/10/14/psql/<hr />
<p>Today I worked with Dan Halperin and Brandon Holt on Postgres table
design for our <a href="https://github.com/uwescience/kbmod">KBMOD</a> project.
We are using the <a href="http://postgis.net/">PostGIS</a> package to provide
support for spatial overlap queries (on the celestial sphere instead
of the terrestrial one). The big challenge is how to optimally store
(and to query) information on 1e6 images that contain a total of 1e12
pixels. The ultimate goal is to intersect these data with proposed
orbits of Solar System objects to understand if there is evidence for
them in the time series of images.</p>
<p>We worked towards comparing two table designs. The first on allows us
to intersect an orbit trajectory with a given image, which acts as a sort
of coarse database index, allowing us to only take a detailed look at the
pixels within the images that intersect the orbit:</p>
<div class="highlight"><pre><span class="c1">-- IMAGES table:</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">fields</span> <span class="p">(</span>
<span class="n">fieldId</span> <span class="nb">BIGINT</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
<span class="n">run</span> <span class="nb">INTEGER</span><span class="p">,</span>
<span class="n">camcol</span> <span class="nb">SMALLINT</span><span class="p">,</span>
<span class="n">field</span> <span class="nb">INTEGER</span><span class="p">,</span>
<span class="n">filter</span> <span class="nb">VARCHAR</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span>
<span class="n">bbox</span> <span class="n">GEOMETRY</span><span class="p">(</span><span class="n">POLYGON</span><span class="p">,</span><span class="mi">3786</span><span class="p">),</span>
<span class="n">tmid</span> <span class="k">TIMESTAMP</span> <span class="k">WITH</span> <span class="n">TIME</span> <span class="k">ZONE</span><span class="p">,</span>
<span class="n">trange</span> <span class="n">TSTZRANGE</span>
<span class="p">);</span>
<span class="c1">-- PIXELS table:</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pixels</span> <span class="p">(</span>
<span class="n">pixelId</span> <span class="n">BIGSERIAL</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
<span class="n">fieldId</span> <span class="nb">BIGINT</span> <span class="k">REFERENCES</span> <span class="n">fields</span><span class="p">(</span><span class="n">fieldId</span><span class="p">),</span>
<span class="n">ra</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">decl</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">fval</span> <span class="nb">REAL</span><span class="p">,</span>
<span class="n">radec</span> <span class="n">GEOMETRY</span><span class="p">(</span><span class="n">POINT</span><span class="p">,</span><span class="mi">3786</span><span class="p">),</span>
<span class="n">mask</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
<span class="c1">-- EXAMPLE QUERY:</span>
<span class="k">SELECT</span> <span class="n">p</span><span class="p">.</span><span class="n">pixelId</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">ra</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">decl</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">fval</span><span class="p">,</span> <span class="n">ST_DISTANCE</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">radec</span><span class="p">)</span> <span class="k">AS</span> <span class="n">dist</span> <span class="k">FROM</span>
<span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="o">-</span><span class="mi">42</span><span class="p">.</span><span class="mi">8471955</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">7336945</span><span class="p">),</span><span class="mi">3786</span><span class="p">)</span> <span class="k">as</span> <span class="n">traj</span><span class="p">,</span>
<span class="n">pixels</span> <span class="k">as</span> <span class="n">p</span><span class="p">,</span>
<span class="n">fields</span> <span class="k">as</span> <span class="n">f</span>
<span class="k">WHERE</span>
<span class="k">TIMESTAMP</span> <span class="k">WITH</span> <span class="n">TIME</span> <span class="k">ZONE</span> <span class="s1">'2006-10-21 03:11:44.69136z'</span> <span class="o"><@</span> <span class="n">f</span><span class="p">.</span><span class="n">trange</span>
<span class="k">AND</span>
<span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">bbox</span><span class="p">)</span>
<span class="k">AND</span>
<span class="n">f</span><span class="p">.</span><span class="n">fieldId</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">fieldId</span>
<span class="k">ORDER</span> <span class="k">BY</span> <span class="n">dist</span>
<span class="k">LIMIT</span> <span class="mi">10</span><span class="p">;</span>
</pre></div>
<p>A second way us to use a real database index to tell us which pixels
to look at. I.e. intersect each trajectory with the pixels table, as
below. We think that first intersecting with the fields table, and
then intersecting with the pixels in that field, <em>should</em> provide
optimal performance. If the database index is up to the task...</p>
<div class="highlight"><pre><span class="c1">-- PIXELS table:</span>
<span class="k">CREATE</span> <span class="k">TABLE</span> <span class="n">pixels</span> <span class="p">(</span>
<span class="n">pixelId</span> <span class="n">BIGSERIAL</span> <span class="k">PRIMARY</span> <span class="k">KEY</span><span class="p">,</span>
<span class="n">fieldId</span> <span class="nb">BIGINT</span> <span class="k">REFERENCES</span> <span class="n">fields</span><span class="p">(</span><span class="n">fieldId</span><span class="p">),</span>
<span class="n">ll_r</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">ll_d</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">lr_r</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">lr_d</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">ur_r</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">ur_d</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">ul_r</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">ul_d</span> <span class="n">DOUBLE</span> <span class="k">PRECISION</span><span class="p">,</span>
<span class="n">bbox</span> <span class="n">GEOMETRY</span><span class="p">(</span><span class="n">POLYGON</span><span class="p">,</span><span class="mi">3786</span><span class="p">),</span>
<span class="n">flux</span> <span class="nb">REAL</span><span class="p">,</span>
<span class="n">mask</span> <span class="nb">INTEGER</span>
<span class="p">);</span>
<span class="c1">-- EXAMPLE QUERY:</span>
<span class="k">SELECT</span> <span class="n">p</span><span class="p">.</span><span class="n">pixelId</span><span class="p">,</span> <span class="n">ST_AsText</span><span class="p">(</span><span class="n">ST_Centroid</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">bbox</span><span class="p">)),</span> <span class="n">p</span><span class="p">.</span><span class="n">flux</span> <span class="k">FROM</span>
<span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="o">-</span><span class="mi">42</span><span class="p">.</span><span class="mi">8471955</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">7336945</span><span class="p">),</span><span class="mi">3786</span><span class="p">)</span> <span class="k">as</span> <span class="n">traj</span><span class="p">,</span>
<span class="n">pixels</span> <span class="k">as</span> <span class="n">p</span><span class="p">,</span>
<span class="n">fields</span> <span class="k">as</span> <span class="n">f</span>
<span class="k">WHERE</span>
<span class="k">TIMESTAMP</span> <span class="k">WITH</span> <span class="n">TIME</span> <span class="k">ZONE</span> <span class="s1">'2006-10-28 02:55:13.932192z'</span> <span class="o"><@</span> <span class="n">f</span><span class="p">.</span><span class="n">trange</span>
<span class="k">AND</span>
<span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">bbox</span><span class="p">)</span>
<span class="k">AND</span>
<span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">bbox</span><span class="p">)</span>
<span class="k">AND</span>
<span class="n">f</span><span class="p">.</span><span class="n">fieldId</span> <span class="o">=</span> <span class="n">p</span><span class="p">.</span><span class="n">fieldId</span><span class="p">;</span>
<span class="c1">-- EXAMPLE QUERY THAT USES A SUBQUERY TO FIRST FILTER ON FIELD</span>
<span class="k">SELECT</span>
<span class="n">ST_SetSRID</span><span class="p">(</span><span class="n">ST_MakePoint</span><span class="p">(</span><span class="o">-</span><span class="mi">42</span><span class="p">.</span><span class="mi">8471955</span><span class="p">,</span> <span class="mi">0</span><span class="p">.</span><span class="mi">7336945</span><span class="p">),</span><span class="mi">3786</span><span class="p">)</span> <span class="k">as</span> <span class="n">traj</span><span class="p">,</span>
<span class="p">(</span><span class="n">fieldId</span> <span class="k">FROM</span> <span class="n">fields</span> <span class="k">as</span> <span class="n">f</span>
<span class="k">WHERE</span>
<span class="k">TIMESTAMP</span> <span class="k">WITH</span> <span class="n">TIME</span> <span class="k">ZONE</span> <span class="s1">'2006-10-28 02:55:13.932192z'</span> <span class="o"><@</span> <span class="n">f</span><span class="p">.</span><span class="n">trange</span>
<span class="k">AND</span>
<span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">f</span><span class="p">.</span><span class="n">bbox</span><span class="p">)</span>
<span class="p">)</span> <span class="k">as</span> <span class="n">fieldId</span><span class="p">,</span>
<span class="n">p</span><span class="p">.</span><span class="n">pixelId</span><span class="p">,</span> <span class="n">ST_AsText</span><span class="p">(</span><span class="n">ST_Centroid</span><span class="p">(</span><span class="n">p</span><span class="p">.</span><span class="n">bbox</span><span class="p">)),</span> <span class="n">p</span><span class="p">.</span><span class="n">flux</span>
<span class="k">FROM</span>
<span class="n">pixels</span> <span class="k">as</span> <span class="n">p</span>
<span class="k">WHERE</span>
<span class="n">p</span><span class="p">.</span><span class="n">fieldId</span> <span class="o">=</span> <span class="n">fieldId</span>
<span class="k">AND</span>
<span class="n">ST_INTERSECTS</span><span class="p">(</span><span class="n">traj</span><span class="p">,</span> <span class="n">p</span><span class="p">.</span><span class="n">bbox</span><span class="p">);</span>
</pre></div>
<p>Note that we use a subquery in the second example, which should force
the database to restrict the scan over pixels to those that land in
the fields that overlap the trajcetory.</p>
<p>My goal is to generate a limited batch of data for both types of pixel
tables (100 images of 1e6 pixels), so we can compare query
performance.</p>Data Driven Discovery2014-02-24T12:50:00-08:00Andrew Beckertag:acbecker.github.io,2014-02-24:blog/2014/02/24/ddd/<hr />
<p>I've just submitted a proposal to the Moore Foundation's Data Driven
Discovery
<a href="http://www.moore.org/programs/science/data-driven-discovery">Initiative</a>.
This is the first of two rounds of the proposal process. If selected,
I would be awarded $1.5M over 5 years, which would be my biggest
individual grant yet. Read on for the core proposal idea...</p>
<hr />
<p>Astrophysics has been revolutionized by the development of science-grade digital imaging devices and by Moore’s progression of computing power. The former has enabled wide field imaging systems on large aperture telescopes, and the latter ensures their data can be gathered in volume and yet still analyzed in real-time. These advances have led to astronomical surveys characterized by successively more extreme combinations of depth, cadence, and sky coverage, challenging Astronomers to analyze ever larger data streams. The next generation of surveys are designed to concurrently address many of the outstanding issues in astrophysics, with answers hidden within multi-dimensional archives of data. Their core science goals include separating the novel from the prosaic, using time-series of information and modeling and classification of lightcurve features. However, <strong>the success of these surveys is not guaranteed</strong>, without advances in the tools to model and interpret their data.</p>
<p><strong>Major Accomplishments</strong>: As faculty at the University of Washington, I have served as PI, Co-I and project manager of multiple successful grants from NASA and the NSF. I have been a strong advocate for and active participant in the open sharing of data and algorithms. From system administrator to PI, my immersion in time-domain astronomy has provided a detailed understand- ing of the mechanics behind large scale data acquisition and analysis. This work has helped me attain recognition as a world leader in time-domain science, with my software contributing to many of the novel astrophysical discoveries of the past decade, and in use by many of the current survey projects. This success is reflected in my citation record (in the top <a href="http://archive.sciencewatch.com/inter/aut/2010/10-may/10mayBeck/">1%</a> in the field of space sciences) and in my H-index of <a href="http://scholar.google.com/citations?hl=en&user=OxatP8YAAAAJ">57</a>. And while many things have changed in the field since I first started, one thing has remained constant: the Universe always reveals itself through its variability.</p>
<p>I have been actively involved in the field of time-domain astrophysics since joining the MACHO project in 1995. MACHO utilized one of the first large-format CCD mosaic cameras – the largest in astronomy at the time – to help usher in the era of "big data" in our field. Our project accumulated 5 TB of raw images and a 200 GB database of derived data <em>before the year 2000</em>. My work involved filtering these data in real-time, searching through the lightcurves of $10^7$ stars, each with $10^3$ observations, for rare events that numbered $10^1$. The main challenge then was data storage and access, which we solved using a room-sized "mass store" carousel that included a robotic arm to retrieve data tapes. The challenges I've faced in each successive survey reflect the progress of technology through the early 21st century: understanding how to implement and use RAIDed disk arrays for high capacity redundant storage; how to distribute processes over multi-node compute clusters while maintaining input/output load balance; how to arrange data and manage memory for large retrospective analyses; how to implement algorithms to take advantage of parallel processing environments; and how to design hierarchical models that reflect structure in the data and allow for complex and insightful analyses. The structure and scale of data have always been significant challenges in my research, overcome with creativity and not a little bit of compromise in the pursuit of progress. Moore's law tends to solve practical issues of capability, but creates new ones of scale. Surmounting this requires reimagined approaches and algorithms that are scalable, tightly integrated with the compute environment, and allow for a significant level of human interaction with, and feedback between, data and model.</p>
<p>At Bell Laboratories, I led the first astrophysical time-domain survey that detected and classified all manner of variability discovered, releasing this information to the community in real–time. This is the paradigm, writ exponentially larger, of the survey I've now spent 8 years working to bring to fruition, the Large Synoptic Survey Telescope (LSST). I have contributed directly to coding of its core algorithmic framework, to implementing novel algorithms within this framework, analyzing $10^{9.5}$ row precursor databases for quality assessment, generating interactive visualizations of our results, and illuminating the interface between the data and the science questions that will be asked of it. Our team has generated over 350k source lines of code in C++ and Python, all in an open-source environment. This project will change the world. But with LSST poised for 8 years of construction, it is prudent to step back and consider how, even with 16 years of lead time and a focus on software as a primary project risk, it might fail to live up to its full potential.</p>
<p><strong>Future Research Direction</strong>: Our world is monitored by sensors, and our lives are awash in information generated by them. These hyper-aware devices return time-series information on heart rates, room temperatures, stock transactions, self-driving cars, and variable stars in the farthest reaches of our Galaxy. In many cases, these devices, or engines that feed off them, must make autonomous decisions based upon received or derived information: to defibrillate, turn on the air conditioning, go short, perform emergency maneuvers, or redirect the largest telescope in the world to observe a once-in-our-Galaxy's-lifetime cosmological event. The interpretation of these streams has real-world consequences, thus the need for effective and practical modeling of them is paramount. This is inarguably a problem where the 21st century's deluge of data has outstripped our abilities to parse, understand, and make informed (let alone optimal) decisions off of them.</p>
<p>Such streams share common characteristics, in that they contain a (typically multi-variate) set of floating point numbers, include measurement noise, and are time-stamped. The data within these streams are generally correlated in time; if multi-variate, they are often correlated across channels. This opens the problem to rich families of models (autoregressive; state-based) that have historically been used for time-series analysis, with the crux being that modern information presents much higher dimensionality and rate than these algorithms have been previously faced with. Key challenges to using these models include being able to reduce the dimensionality of the problem to something intellectually intuitable and computationally tractable, generating a model that provides backcasting and forecasting abilities, and doing so with calibrated uncertainties. In practical terms this requires a marriage of well-tested time-series models with a modern understanding of statistical inference, confidence modeling and visualization, implemented with the flexibility to perform on a distributed, highly parallel compute infrastructure.</p>
<p>I propose here the development of a probabilistic, multi-variate time-series modeling and classification engine that yields predictions in real-time using irregularly sampled and noisy data. The compute must be scalable, and tuned to operate on distributed or GPU infrastructures. The models must be flexible and hierarchical to capture uncertainties both in the data and in the models themselves. Tools will be generated to visualize and interact with the models, and services implemented to generate real-time predictions as the models evolve under the flood of data. Astronomy provides a particularly difficult use case for this class of problems, given the irregular structure of our data. Accordingly, solutions generated in this domain are likely to be broadly applicable to other types of time-series modeling, which we will make a key consideration in our work.</p>
<p>The risk of not funding such ideas is that <strong>the process of discovery on large datasets is not guaranteed</strong>. Without this sort of innovation, gaining understanding of the structure in time-series datasets will remain beyond the reach of many domain scientists, who simply will not have the tools to undertake the requisite analyses. To make this work most accessible to them, all of our code will be developed under open source license. We will integrate and build upon extant tools (e.g. <a href="http://ipython.org/">iPython</a>) that have demonstrably altered the landscape in terms of reproducible science and collaborative coding, to foster the highest likelihood of creating an impactful project. We will operate in close collaboration with UW's <a href="http://escience.washington.edu/">eScience</a> institute to ensure that this project is undertaken in an eclectic, interdisciplinary environment. <strong>To summarize, this proposal would effectively create a center focused on time domain science within the broader UW eScience ecosystem</strong>, which includes a data science environment supported by the Moore and Sloan Foundations. This idea speaks directly to the DDD Initiative in that its advances will be cross-disciplinary and have immediate real-world applicability to extant and future data sets.</p><script type= "text/javascript">
if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https:' == document.location.protocol
? 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
: 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>
Odds of Correctly Predicting the NCAA Basketball Tournament2014-02-10T10:58:00-08:00Andrew Beckertag:acbecker.github.io,2014-02-10:blog/2014/02/10/ncaa/<hr />
<p>I was briefly interviewed for a
<a href="http://online.wsj.com/news/articles/SB10001424052702304450904579367153999135482?mod=Business_newsreel_3">Wall Street Journal</a>
article on Warren Buffett's insuring of a $\$1$ billion dollar prize for
any contestant who correctly picks all 63 games in the 2013 NCAA
basketball tournament (see also
<a href="http://www.marketwatch.com/story/yahoo-warren-buffett-and-a-1-billion-contest-2014-02-09-194491044">here</a>). Don't ask
how this happened; it involves consorting with a bunch of degenerates. As stated in the
article, "Mr. Buffett's Berkshire Hathaway would take on the risk, and earn a fee for doing so". My position
as stated in the article is that this paid premium to Berkshire Hathaway was unnecessary, since the odds of picking the outcome of these 63 games
correctly is so unlikely. A great deal for Mr. Buffett, but that is indeed his reputation. I expand on these thoughts below.
</p>
<p>Naively, the odds of picking 63 games correctly are as follows. If
the outcome of a single game was random, the odds of either team
winning (or of correctly picking the winner) would be 1 in 2, or 50%.
The odds of anyone picking the correct outcome of 2 games (team A vs
team B; and team C vs team D) are 1 in 4 (either A and C win; A and D
win; B and C win; or B and D win), or 25%. A pattern quickly emerges:
the chance of picking $N$ games correctly goes like $1/2^N$. This
generalizes even further: assuming odds $O$ of any particular choice
being correct (which is 50%, or 0.5, above), the likelihood of picking
$N$ games correct is $O^N$. Naively assuming odds of 0.5 for each of
63 games in the tournament, the chances of getting them all correct is
$0.5^{63} = 1.1 \times 10^{-19}$, roughly 1-to-$9.2 \times
10^{18}$, or 9.2 billion billion. This is also approximately the
number of grains of
<a href="http://www.npr.org/blogs/krulwich/2012/09/17/161096233/which-is-greater-the-number-of-sand-grains-on-earth-or-stars-in-the-sky">sand</a>
on the entire planet Earth. <strong>Such is the power of exponential growth</strong>,
that you can double something, double it again, and by doing this 63
times you have an entire planet's worth of material. This process is not
unlike the growth in technology we currently find ourselves in the
midst of, where computing power is basically doubling every 18-24
<a href="http://en.wikipedia.org/wiki/Moore%27s_law">months</a>. A quick
computation suggests that we are around 20 doublings into this process.</p>
<p>So in this ideal world, $\$9.2$ billion billion should be the
monetary payout for correctly calling all 63 games of the NCAA
tournament, not a mere $\$1$ billion. To think about this another
way, each dollar in that $\$1$ billion payout should itself be worth
$\$9.2$ billion. The <a href="http://en.wikipedia.org/wiki/Vigorish">vig</a> on
this bet would be enormous, perhaps unprecedented.</p>
<p>Now, the above assumes that the outcome of each game is basically a coin
flip. But we know this to <strong>not</strong> be the case. Teams are seeded
based on (among other considerations) their perceived strength, from 1
to 16, with #1 seeds expected to proceed farther into the tournament
than #16 seeds. In addition, the first and sixteen seeds play each
other in the first round of the tournament, so this is not nearly a
coin flip. In the WSJ article, Mr, Buffett is quoted as saying the
odds can't be calculated, and while this might be technically true
(no-one can predict the future), they can be estimated. </p>
<h2>Estimating The Odds</h2>
<p>To understand the true odds of calculating all 63 games of a NCAA
tournament correctly, one would ideally like many realizations of all
possible scenarios, to understand what fraction of the time the one
particular scenario of interest played out. This is of course
impossible without the assistance of a Level-III
<a href="http://en.wikipedia.org/wiki/Multiverse">multiverse</a>, let alone that
there are 9.2 billion billion ways a single tournament in our Universe
could unfold. So we take one step in a data-driven look at this
problem, which is to calculate the average odds $O$ that the favored
team beats the underdog. This is somewhat of a "best case scenario",
in the sense that one could pick the next round of a bracket after
each previous one, picking only the higher seed of the two competitors
for any given game.</p>
<p>As a caveat, this is a very coarse parameterization of our ignorance of
just how the NCAA seeding rules create matchups that are able to be
predicted. An additional complication is that the seeding rules yield
different teams in a given seed each year. And each of these teams
will have in general different players and perhaps even coaches than
its previous entry. This is the system complexity that Mr. Buffett
alludes to in the WSJ article.</p>
<p>To start this investigation, I obtained
<a href="http://apps.washingtonpost.com/sports/apps/live-updating-mens-ncaa-basketball-bracket/search/?pri_school_id=&pri_conference=&pri_coach=&pri_seed_from=1&pri_seed_to=16&pri_power_conference=&pri_bid_type=&opp_school_id=&opp_conference=&opp_coach=&opp_seed_from=1&opp_seed_to=16&opp_power_conference=&opp_bid_type=&game_type=7&from=1985&to=2013&submit=">here</a>
a list of all the matchups in the NCAA basketball tournaments going
back to 1985. From here it was a trivial computation to determine
what fraction of higher-seeded teams typically won. Since there are 4
teams in each tournament with a given numerical seed, I have ignored all
competitions where equal-numbered seeds went against each other.</p>
<h2>Results</h2>
<p>I first split the data up into the 6 rounds: "First Round", "Second Round", "Sweet 16", "Elite Eight", "Final Four", "National Championship".
The fraction of top seeded teams that won in each round are listed below:</p>
<div class="CSSTableGenerator" >
<table >
<tr>
<td> Value </td>
<td> </td>
<td> First Round </td>
<td> Second Round </td>
<td> Sweet 16 </td>
<td> Elite Eight </td>
<td> Final Four </td>
<td> National Championship </td>
</tr>
<tr>
<td> Total number of games </td>
<td> </td>
<td> 928 </td>
<td> 464 </td>
<td> 232 </td>
<td> 116 </td>
<td> 43 </td>
<td> 23 </td>
</tr>
<tr>
<td> Number won by higher seed </td>
<td> </td>
<td> 693 </td>
<td> 327 </td>
<td> 166 </td>
<td> 66 </td>
<td> 28 </td>
<td> 17 </td>
</tr>
<tr>
<td> Fraction won by higher seed </td>
<td> </td>
<td> 75% </td>
<td> 70% </td>
<td> 72% </td>
<td> 57% </td>
<td> 65% </td>
<td> 74% </td>
</tr>
</table>
</div>
<p>As expected, the higher ranked seed performed exceptionally well in
the first round, winning approximately 75% of their games. From the
second round on, this advantage is somewhat lessened, but there are
also fewer games in the sample. Note, this analysis does <em>not</em> take
into account e.g. a particularly outstanding lower seed that left a
trail of correlated destruction in their wake (e.g. George Mason,
2006). For the overall numbers, the higher seeds win 72% of their
games.</p>
<p>So how does this wrap into the bracket contest above? We proceed by
looking at a "best case" scenario, where we can pick the results of
the "Second Round" after seeing the results of the "First Round", of
the "Sweet 16" after seeing the results of the "Second Round", etc.
First, use a likelihood of 0.75 for guessing all 32 "First Round"
games correctly, 0.70 for all 16 "Second Round" games, 0.72 for 8
"Sweet 16" games, 0.57 for 4 "Elite Eight" games, 0.65 for 2 "Final
Four" games, and 0.74 for the 1 Championship Game. This yields a
1-in-1.4 billion chance that the higher ranked team wins every game in
the NCAA tournament, assuming you know the exact teams that go into
each and every game (in truth this is not how the tournament plays
out, as you have to fill out your bracket all at once, before any of
the 63 games are played). However, the above numbers potentially
suffer from the process of "overfitting" where we have subdivided the
data too finely (e.g. per round) and end up multiplying together a
bunch of noisy, uncertain values. If we instead use the overall
average odds of 0.72 for the higher seed to win, we find a more
favorable scenario, a slightly worse than 1-in-1.1 billion chance of
the desired outcome.</p>
<p>No matter how you slice it, given the data on-hand, the likelihood of
picking a tournament correctly is exceedingly miniscule. Only if you
were able to interactively pick your bracket after each round, and
only chose the favorites in each and every game, would you have a
1-in-a billion chance to win $\$1$ billion dollars. Which is not too
shabby. And the $\$11$ million premium being paid to insure the
money? This is about 1% of the total prize; if the odds of winning
the prize were 1 in 100, this might seem like the appropriate level to
insure the kitty. However, given the rules of the tournament, the
true likelihood of correctly guessing all 63 games is far less than
our best case scenario, which is still only a 1-in-1.1 billion chance.
Given this absolute best likelihood is worse than 1-in-1 billion, and
the value of the prize is (perhaps not coincidentally?) $\$1$ billion,
the true premium on insuring this bet should likely be less than
$\$1$.</p>
<p>Bottom line, this is great bet for Berkshire Hathaway to make, and
unnecessary insurance for the sponsor of the competition. It is,
however, good PR for all involved.</p>
<ul>
<li>Update 1 (2012-02-10): A friend has pointed out that the contest is
being capped at 10 million entrants, whereas my analysis above is
for a single entry. This detail actually provides an
order-of-magnitude risk of 1% of this contest being won, <strong><em>assuming
the best case scenario</em></strong>. This then calls into question what is the real
risk of the contest, as designed, being won by any of 10 million entries. An
estimate using the <a href="http://en.wikipedia.org/wiki/Geometric_mean">geometric mean</a>
(which is useful in
<a href="http://www.maa.org/publications/periodicals/loci/joma/problem-solving-estimation-and-orders-of-magnitude">order of magnitude</a>
calculations) of the best case and worst case scenarios suggests that
the risk per entry would be around 100,000 billion-to-one ($\sqrt(9.2
\times 10^{18} * 1.1 \times 10^9)$). So the chance of being won by
any of the 10 million entrants is about 1-in-10 million. Meaning I
would insure that billion for a cool $\$100$. </li>
</ul><script type= "text/javascript">
if (!document.getElementById('mathjaxscript_pelican_#%@#$@#')) {
var mathjaxscript = document.createElement('script');
mathjaxscript.id = 'mathjaxscript_pelican_#%@#$@#';
mathjaxscript.type = 'text/javascript';
mathjaxscript.src = 'https:' == document.location.protocol
? 'https://c328740.ssl.cf1.rackcdn.com/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML'
: 'http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
mathjaxscript[(window.opera ? "innerHTML" : "text")] =
"MathJax.Hub.Config({" +
" config: ['MMLorHTML.js']," +
" TeX: { extensions: ['AMSmath.js','AMSsymbols.js','noErrors.js','noUndefined.js'], equationNumbers: { autoNumber: 'AMS' } }," +
" jax: ['input/TeX','input/MathML','output/HTML-CSS']," +
" extensions: ['tex2jax.js','mml2jax.js','MathMenu.js','MathZoom.js']," +
" displayAlign: 'center'," +
" displayIndent: '0em'," +
" showMathMenu: true," +
" tex2jax: { " +
" inlineMath: [ ['$','$'] ], " +
" displayMath: [ ['$$','$$'] ]," +
" processEscapes: true," +
" preview: 'TeX'," +
" }, " +
" 'HTML-CSS': { " +
" styles: { '.MathJax_Display, .MathJax .mo, .MathJax .mi, .MathJax .mn': {color: 'black ! important'} }" +
" } " +
"}); ";
(document.body || document.getElementsByTagName('head')[0]).appendChild(mathjaxscript);
}
</script>