Data Driven Discovery

I've just submitted a proposal to the Moore Foundation's Data Driven Discovery Initiative. This is the first of two rounds of the proposal process. If selected, I would be awarded $1.5M over 5 years, which would be my biggest individual grant yet. Read on for the core proposal idea...

Astrophysics has been revolutionized by the development of science-grade digital imaging devices and by Moore’s progression of computing power. The former has enabled wide field imaging systems on large aperture telescopes, and the latter ensures their data can be gathered in volume and yet still analyzed in real-time. These advances have led to astronomical surveys characterized by successively more extreme combinations of depth, cadence, and sky coverage, challenging Astronomers to analyze ever larger data streams. The next generation of surveys are designed to concurrently address many of the outstanding issues in astrophysics, with answers hidden within multi-dimensional archives of data. Their core science goals include separating the novel from the prosaic, using time-series of information and modeling and classification of lightcurve features. However, the success of these surveys is not guaranteed, without advances in the tools to model and interpret their data.

Major Accomplishments: As faculty at the University of Washington, I have served as PI, Co-I and project manager of multiple successful grants from NASA and the NSF. I have been a strong advocate for and active participant in the open sharing of data and algorithms. From system administrator to PI, my immersion in time-domain astronomy has provided a detailed understand- ing of the mechanics behind large scale data acquisition and analysis. This work has helped me attain recognition as a world leader in time-domain science, with my software contributing to many of the novel astrophysical discoveries of the past decade, and in use by many of the current survey projects. This success is reflected in my citation record (in the top 1% in the field of space sciences) and in my H-index of 57. And while many things have changed in the field since I first started, one thing has remained constant: the Universe always reveals itself through its variability.

I have been actively involved in the field of time-domain astrophysics since joining the MACHO project in 1995. MACHO utilized one of the first large-format CCD mosaic cameras – the largest in astronomy at the time – to help usher in the era of "big data" in our field. Our project accumulated 5 TB of raw images and a 200 GB database of derived data before the year 2000. My work involved filtering these data in real-time, searching through the lightcurves of $10^7$ stars, each with $10^3$ observations, for rare events that numbered $10^1$. The main challenge then was data storage and access, which we solved using a room-sized "mass store" carousel that included a robotic arm to retrieve data tapes. The challenges I've faced in each successive survey reflect the progress of technology through the early 21st century: understanding how to implement and use RAIDed disk arrays for high capacity redundant storage; how to distribute processes over multi-node compute clusters while maintaining input/output load balance; how to arrange data and manage memory for large retrospective analyses; how to implement algorithms to take advantage of parallel processing environments; and how to design hierarchical models that reflect structure in the data and allow for complex and insightful analyses. The structure and scale of data have always been significant challenges in my research, overcome with creativity and not a little bit of compromise in the pursuit of progress. Moore's law tends to solve practical issues of capability, but creates new ones of scale. Surmounting this requires reimagined approaches and algorithms that are scalable, tightly integrated with the compute environment, and allow for a significant level of human interaction with, and feedback between, data and model.

At Bell Laboratories, I led the first astrophysical time-domain survey that detected and classified all manner of variability discovered, releasing this information to the community in real–time. This is the paradigm, writ exponentially larger, of the survey I've now spent 8 years working to bring to fruition, the Large Synoptic Survey Telescope (LSST). I have contributed directly to coding of its core algorithmic framework, to implementing novel algorithms within this framework, analyzing $10^{9.5}$ row precursor databases for quality assessment, generating interactive visualizations of our results, and illuminating the interface between the data and the science questions that will be asked of it. Our team has generated over 350k source lines of code in C++ and Python, all in an open-source environment. This project will change the world. But with LSST poised for 8 years of construction, it is prudent to step back and consider how, even with 16 years of lead time and a focus on software as a primary project risk, it might fail to live up to its full potential.

Future Research Direction: Our world is monitored by sensors, and our lives are awash in information generated by them. These hyper-aware devices return time-series information on heart rates, room temperatures, stock transactions, self-driving cars, and variable stars in the farthest reaches of our Galaxy. In many cases, these devices, or engines that feed off them, must make autonomous decisions based upon received or derived information: to defibrillate, turn on the air conditioning, go short, perform emergency maneuvers, or redirect the largest telescope in the world to observe a once-in-our-Galaxy's-lifetime cosmological event. The interpretation of these streams has real-world consequences, thus the need for effective and practical modeling of them is paramount. This is inarguably a problem where the 21st century's deluge of data has outstripped our abilities to parse, understand, and make informed (let alone optimal) decisions off of them.

Such streams share common characteristics, in that they contain a (typically multi-variate) set of floating point numbers, include measurement noise, and are time-stamped. The data within these streams are generally correlated in time; if multi-variate, they are often correlated across channels. This opens the problem to rich families of models (autoregressive; state-based) that have historically been used for time-series analysis, with the crux being that modern information presents much higher dimensionality and rate than these algorithms have been previously faced with. Key challenges to using these models include being able to reduce the dimensionality of the problem to something intellectually intuitable and computationally tractable, generating a model that provides backcasting and forecasting abilities, and doing so with calibrated uncertainties. In practical terms this requires a marriage of well-tested time-series models with a modern understanding of statistical inference, confidence modeling and visualization, implemented with the flexibility to perform on a distributed, highly parallel compute infrastructure.

I propose here the development of a probabilistic, multi-variate time-series modeling and classification engine that yields predictions in real-time using irregularly sampled and noisy data. The compute must be scalable, and tuned to operate on distributed or GPU infrastructures. The models must be flexible and hierarchical to capture uncertainties both in the data and in the models themselves. Tools will be generated to visualize and interact with the models, and services implemented to generate real-time predictions as the models evolve under the flood of data. Astronomy provides a particularly difficult use case for this class of problems, given the irregular structure of our data. Accordingly, solutions generated in this domain are likely to be broadly applicable to other types of time-series modeling, which we will make a key consideration in our work.

The risk of not funding such ideas is that the process of discovery on large datasets is not guaranteed. Without this sort of innovation, gaining understanding of the structure in time-series datasets will remain beyond the reach of many domain scientists, who simply will not have the tools to undertake the requisite analyses. To make this work most accessible to them, all of our code will be developed under open source license. We will integrate and build upon extant tools (e.g. iPython) that have demonstrably altered the landscape in terms of reproducible science and collaborative coding, to foster the highest likelihood of creating an impactful project. We will operate in close collaboration with UW's eScience institute to ensure that this project is undertaken in an eclectic, interdisciplinary environment. To summarize, this proposal would effectively create a center focused on time domain science within the broader UW eScience ecosystem, which includes a data science environment supported by the Moore and Sloan Foundations. This idea speaks directly to the DDD Initiative in that its advances will be cross-disciplinary and have immediate real-world applicability to extant and future data sets.

Put Yourself Out There

Reports from the front on Astrophysics, Statistics, and Data Science

Data Driven Discovery

Comments