The Art of Data Augmentation

Journal of Graphical and Computational Statistics

David A. van Dyk
Department of Statistics, Harvard University, Cambridge, MA 02138, U.S.A.
Xiao-Li Meng
Department of Statistics, The University of Chicago, Chicago, IL 60637, U.S.A.

The term data augmentation refers to methods for constructing iterative algorithms via the introduction of unobserved data or latent variables. For deterministic algorithms, the method was popularized in the general statistical community by the seminal paper of Dempster, Laird, and Rubin (1977) on the EM algorithm for maximizing a likelihood function or more generally a posterior density. For stochastic algorithms, the method was popularized in the statistical literature by Tanner and Wong's (1987) Data Augmentation algorithm for posterior sampling, and in the physics literature by Swendsen and Wang's (1987) algorithm for sampling from Ising and Potts models and it generalizations; in the physics literature, the method of data augmentation is referred to as the method of auxiliary variables. Data augmentation schemes were used by Tanner and Wong to make simulation feasible and simple, while auxiliary variables were adopted by Swendsen and Wang to improve the speed of iterative simulation. In general, however, constructing data augmentation schemes that result in both simple and fast algorithms is a matter of art in that successful strategies vary greatly with the observed-data models being considered. After an overview of data augmentation/auxiliary variables and some recent developments in methods for constructing such efficient data augmentation schemes, we introduce an effective general strategy which combines the ideas of marginal augmentation and conditional augmentation, together with a deterministic approximation method for selecting good augmentation schemes. We then apply this strategy to three common classes of models, i.e., multivariate t, probit regression, and mixed-effects models, to obtain efficient Markov chain Monte Carlo algorithms for posterior sampling. We provide theoretical and empirical evidence that the resulting new algorithms, while requiring similar computational efforts, can be much faster to converge than the common Gibbs samplers adopted for these models in practice.

Keywords: AUXILIIARY VARIABLES; EM ALGORITHM; GIBBS SAMPLER; HIERARCHICAL MODELS; MARKOV CHAIN MONTE CARLO; PROBIT REGRESSION; MIXED-EFFECTS MODELS; RATE OF CONVERGENCE.

Code for the two-level Gaussian model is available.

Return to David van Dyk's homepage.