AI & Statistics 2009

AISTATS*09 Invited Speakers

Bill Cleveland

William S. Cleveland

Department of Statistics, Purdue University

Data Visualization: Retaining the Information in Data

Data visualization is critical to data analysis. It retains the information in the data. If a statistical model fitted to a dataset is a poor approximation of the patterns in the data, the ensuing inferences based on the model will likely be incorrect. Visualization is the critical methodology for model checking and revision. It is not sufficient to simply optimize a numeric model selection criterion to find the best model in a class because it is possible that no model in the class fits the data.

If a machine learning method is to be used for a task such as classification, an understanding of the patterns in the data through visualization helps determine which classification methods are likely to work best. Using the best performer out of a number of machine learning methods, without checking patterns, is not sufficient because it is possible that all tested methods perform poorly compared with methods not tried, including new methods developed on the spot to take advantage of the patterns discovered by the data visualization.

Visualization and numeric methods of statistics and machine learning go hand in hand in the analysis of a set of data. The numeric methods exploit the immense tactical power of the computer in analyzing the data. Visualization exploits the immense strategic power of the human in analyzing the data. The combination provides the best chance to preserve the information in the data. This combination was the basis of the controversy that arose in Kasparov vs. the machine learning algorithm Deep Blue.

The merger of numeric methods and visualization is illustrated by ed, a new method of nonparametric density estimation. There are two goals in its design: estimation that adapts to a wide range of patterns in data, and a mechanism that allows effective diagnostic checking with visualization tools to see if ed did a good job.

Very large datasets, ubiquitous today, challenge data visualization as they do all areas involved in the analysis of data. Comprehensive visualization that preserves the information in the data requires a visualization database (VDB): many displays, some with many pages and with one or more panels per page. A single display typically results from breaking the data into subsets, and then using the same graphical method to plot each subset, one per panel. A VDB is queried on an as needed basis with a viewer. Some displays might be studied in their entirety; for others, studying only a small fraction of the pages might suffice. On-the-fly computation without storage does not generally succeed because computation time is too large.

The sizes and numbers of displays of VDBs require a rethinking all areas involved in data visualization, including the following: Methods of display design that enhance pattern perception to enable rapid page scanning; Automation algorithms for basic display elements such as the aspect ratio, scales across panels, line types and widths, and symbol types and sizes; Methods for subset view selection; Viewers designed for multi-panel, multi-page displays that work with different amounts of physical screen space.

Bio:William S. Cleveland is the Shanti S. Gupta Distinguished Professor of Statistics and Courtesy Professor of Computer Science at Purdue University. His areas of methodological research are in statistics, machine learning, and data visualization.

Cleveland has analyzed data sets ranging from very small to very large in his research in computer networking, homeland security, visual perception, environmental science, healthcare engineering, and customer opinion polling. In the course of this work he has developed many new methods that are widely used throughout engineering, science, medicine, and business.

In 2002 he was selected as a Highly Cited Researcher by the American Society for Information Science & Technology in the newly formed mathematics category. In 1996 he was chosen national Statistician of the Year by the Chicago Chapter of the American Statistical Association. He is a Fellow of the American Statistical Association, the Institute of Mathematical Statistics, the American Association of the Advancement of Science, and the International Statistical Institute.
Carlos Cleveland

Carlos M. Carvalho

Booth School of Business, University of Chicago

Handling Sparsity via the Horseshoe

In this talk, I will present a new approach to sparse-signal detection called the horseshoe estimator. The horseshoe is a close cousin of the lasso in that it arises from the same class of multivariate scale mixtures of normals, but that it is more robust alternative at handling unknown sparsity patterns. A theoretical framework is proposed for understanding why the horseshoe is a better default sparsity estimator than those that arise from powered-exponential priors. Comprehensive numerical evidence is presented to show that the difference in performance can often be large. Most importantly, I will show that the horseshoe estimator corresponds quite closely to the answers one would get if one pursued a full Bayesian model-averaging approach using a point mass at zero for noise, and a continuous density for signals. Surprisingly, this correspondence holds both for the estimator itself and for the classification rule induced by a simple threshold applied to the estimator. For most of this talk I will study sparsity in the simplified context of estimating a vector of normal means. It is here that the lessons drawn from a comparison of different approaches for modeling sparsity are most readily understood, but these lessons generalize straightforwardly to more difficult problems--regression, covariance regularization, function estimation--where many of the challenges of modern statistics lie. This is join work with Nicholas Polson and James Scott.

Bio: Dr. Carvalho is an assistant professor of econometrics and statistics at The University of Chicago Booth School of Business. Before coming to Chicago, he was part of the Department of Statistical Sciences at Duke University, first as a Ph.D. student followed by a Post-doc under the supervision of Professor Mike West. His research focuses on Bayesian statistics in complex, high-dimensional problems with applications ranging from finance to genetics. Some of his current interests include work on large-scale factor models, graphical models, Bayesian model selection, sequential Monte Carlo methods and stochastic volatility models.

Mark Hansen

Mark H. Hansen

Department of Statistics, UCLA

Words to look at, words to listen to

I will report on some recent collaborative artworks that draw on dynamic data sources. I will spend most of my time on Moveable Type, a large installation for the lobby of the New York Times building in New York City (co-created with Ben Rubin, EAR Studio). In this case, the data sources include a feed of the Times' news stories, an hourly dump of their web access and search logs (a sample, suitably anonymized), and the complete archive back to 1851. I will also spend time on a new work, Exits, which is part of the Terre Natale exhibition at the Cartier Foundation in Paris (co-created with Diller Scofidio + Renfro, Laura Kurgan and Ben Rubin). The centerpiece of our installation is a large, circular projection that tells the story of global human migration and its causes.

Bio: Mark Hansen is an Associate Professor of Statistics at UCLA, where he also holds joint appointments in the departments of Electrical Engineering and Design|Media Art. He is currently serving as Co-PI at CENS, the Center for Embedded Networked Sensing, an NSF STC devoted to research into the design and deployment of sensor networks.