Current Research in SCIVI lab.

Overcomplete topographically ordered map of receptive fields

learned from natural image patches using our Products of Student-T model (PoT)

*Visual object class taxonomies*

In a brief instant we can recognize dozens of objects in our visual field. Some people have estimated the number object categories to be around 50,000... Given the computational "limitations" of our brain we are certainly not scanning all 50,000 templates of known objects and picking the best match. One is led to believe that object categories are represented in a smart way, for instance in a tree structure, so that one can search much more effciently. There are other advantages too, for instance new object categories can be learned from very few examples by merely drawing information from visually similar objects (so called transfer learning or one-shot learning). For instance, a cat looks just like a dog, but is a bit bigger, less fluffy and has much sharper nails.

This project, in collaboration with Pietro Perona's group at Caltech, uses new statistical techniques to infer these representations in an unsupervised setting from image data. The statistical techniques are based on the Dirichlet process and are able to induce model structure from data: the more data is presented to the system, the more complex the representation is allowed to grow. Compare this with a baby growing up. At first she will only recognizes mom's face, but pretty soon dad's face is recognized, his big nose, his red hair etc. One could say that we are trying to "raise a robot".

Affilated people:

Ian Porteous - PhD candidate

Evengiy Bart - postdocal fellow

Pietro Perona - professor Caltech

Representative publications:

*Generalized belief propagation*

Graphical models have become a very powerful tool to design, interpret and communicate probabilistic models. There are two main classes directed models (or Bayesian networks) and undirected models (or random fields). These two species of model follow different semantics, i.e. the list if conditional and marginal independence relations that hold in these models are usally different.

There are a number of important operations to make these models practical tools. For instance, given a model structure and parameter values, and given observed values for a subset of the random variables in the graph, can we infer the posterior probability distribution of all the other variables? This is called "inference" in an AI setting. Many other operations, such as learning the parameter values and structure of the model and finding the most likely state values given evidence crucially depend on inference or are very similar in nature to inference.

Unfortunately, inference quickly becomes intractable as the graph has loops (in fact it is exponential in the treewidth of the graph), so approximations become necessary. We are studying a relatively new class of algorithms which go under the name "generalized belief propagation". These algorithms propagate the necessary information between neighboring clusters in the graph. It is however unclear which clusters to choose and we have found that choosing the wrong clusters (for instance too many) can seriously degrade performance. We have found very neat connections with graph theory, but a clear picture has not yet fully emerged.

Affiiliated people:

Ezekiel Bhasker - PhD candidate

Yee Whye Teh - professor Gatsby Unit (UCL)

Tom Minka - researcher microsoft research Cambridge

David Eppstein - professor UCI

Representative publications:

On the Choice of Regions for Generalized Belief Propagation

(Welling - UAI 04)

Structured Region Graphs: Morphing EP into GBP

(Welling, Minka, Teh - UAI 05)

Directed models have become the dominant paradigm in machine learning. There are many good reasons this: they are easy to sample from, there is a nice (expectation maximization) framework to learn the parameters and it is even possible to search for optimal structure in a full Bayesian setting. Undirected models are much harder to handle, in particular because of the presence of a normalization constant (or partition function) that depends on the parameters and is usually intractable to compute.

Harmoniums are two layer undirected graphical models (see picture below) where the top layer represents an array of hidden variables (or topic variables) while the bottom layer represents the observed random variables (e.g. the count values of words in documents). Effcient learning of parameter values is possible using the "contrastive divergence algorithm" (Hinton). This type of model has been particularly suitable for modeling of text data, where we (and others) have shown that classiffication and retrieval performance is better than for directed counterparts such as LDA (Blei,Ng,Jordan) . One particularly interesting feature that follows from the undirected nature of this model is that the process of inferring topic representations for documents is very fast (one bottom pass or equivalently one matrix multiplication) whereas directed models have to deal with issues like "explaining away" which makes approximate algorithms often the only way out.

Affiiliated people:

Peter Gehler - PhD candidate, Max Planck institute in Tuebingen

Alex Holub - PhD candidate,Caltech

Geoffrey Hinton - professor univerity of Toronto

Michal Rosen Zvi - former postdoc UCI

Representative publications:

Exponential Family Harmoniums with an Application to Information Retrieval

(Welling, Rosen-Zvi,Hinton, NIPS 04)

The Rate Adapting Poisson Model for Information Retrieval and object Recognition

(Gehler, Holub, Welling - ICML 06)

One successful approach to denoising is to 1) transform the image into the wavelet domain using a wavelet pyramid, 2) build a realistic probabilistic joint model of the wavelet coeffcients which captures their statistical dependencies, 3) use this model to see how the statistics of the observed (noisy) wavelet coefficients differs from what you expect from your model, 4) correct these discrepencies using a denoising algorithm (compute the maximum a posterior estimate of the clean wavelet coeffcients given your noisy estimates, assuming a known noise model) 5) invert the wavelet pyramid.

In this project we have proposed a new model to describe the statistical regularities of wavelet coeffcients, based on the "neural network" shown below. Coeffcients get mapped to an energy, which serves as the (unnormalized) negative log-probability. As explained in our paper, this models accurately describes both the heavy tails of the marginal distributions as well as the "bowtie" dependencies between wavelet coefficients.

As an additional feature, our denoiser will automatically pick the wavelet coeffcients which should be modelled jointly by one expert. We have shown that our results are competative with the current stat-of-the-art.

Affiiliated people:

Peter Gehler - PhD candidate, Max Planck institute in Tuebingen

Representative publications:

Products of Edge-Perts

(Gehler, Welling NIPS 05)

Learning Sparse Topographic Representations with Products of Student-T Distributions

(Welling, Hinton, Osindero, NIPS 02)

*Variational inference in non-parametric Bayesian models*

Nonparametric Bayesian models are excellent candidates to infer model structure, i.e the number of clusters in a clustering problem. In particular, we study models based on the Dirichlet process, such as DP mixture models for clustering, Hierarchical DP models for text analysis and Nested DP for taxonomy inference. Their most important disadvantage is that the MCMC-based inference algorithms are often not efficient enough to handle large datasets which may contain in the order of millions of datacases.

In this project we develop new variational inference algorithms for these models. In variational inference we maintain explicit distributions for quantities of interest (e.g. the probability that a datacase is assigned to cluster k) and update them to make them as accurate as possible. This algorithm has no variance (except perhaps for local minima) by trades this off with a hopefully small bias in their estimates (in contrast MCMC algorithms have no bias but considerable variance)

We currently have an implementation for a DP-based clustering algorithm that can handle millions of data-cases. The algorithms starts with assigning the data-cases to rectangular boxes. The boxes form a very rough first partitioning of the space in which the data live. The approximation that is being made is that all particles inside a box have the same probability of being assigned to the clusters in the mixture. At first, the algorithms also entertains just a few clusters. Next, it is faced with making one of the following decisions: either it splits boxes into smaller boxes or it splits clusters into smaller clusters (both are guaranteed to improve the approximation). This process is continued and alternated with updates of the variational distributions until the algorithms runs out of computer time. The model is set up in such a way that the best model has every data-case assigned to a different box and has an infinite number of clusters (but note that most clusters will of course have almost no probability of data-cases being assigned to it)

Affiiliated people:

Kenichi Kurihara -PhD candidate, Tokyo Institute of Technology

Nikos Vlassis - professor, University of Amsterdam

Dave Newman - researcher UCI

Yee Whye Teh - professor Gatsby Unit (UCL)

Representative publications:

Accelerated Variational DP Mixture Models

(Kurihara, Welling, Vlassis NIPS 06)

A Collapsed Variational Bayesian Inference Algorithm for Latent Dirichlet Allocation

(Teh, Newman, Welling, NIPS 06)

*Bayesian Inference and structure learning in undirected graphical models*

Learning the structure of a graphical model is very important to build good models. Where much progress has been made with infering structure for directed graphical models (i.e. Bayes' nets), there has been very little progress with undirected graphical models (MRFs). The reason is the presence of the normalization constant which depends on the parameters and which is intractable to compute.

We propose a to combine two approximations. First we use the assumption that the posterior is close to a Gaussian distribution. This makes a lot of sense for fully observed MRFs because the likelihood function is known to be convex in its parameters. The second approximation is based on an algorithm called "linear response propagation" to compute the covariance of this Gaussian. This approximation is in turn based on belief propagation. If we then finally approximate the log-partition function with the Bethe free energy (again belief propagation) and find good estimates for the "maximum-a-posteriori" (MAP) parameter values we are all set to go...

Our main finding is that this approximation was indeed very accurate for fully observed MRFs and moreover orders of magnitude faster than sampling based methods. The bottleneck turns out to be a good estimate for the MAP parameter values. Performance is seen to crucially depend on this estimate. Next, we will look into models with hidden variables.

Affiiliated people:

Sridevi Parise - PhD candidate, UCI

Kenichi Kurihara -PhD candidate, Tokyo Institute of Technology

Representative publications:

Bayesian Random Fields - The Bethe-Laplace Approximation

(Welling, Parise, UAI 06)

Structure Learning in Markov Random Fields

(Parise, Welling, NIPS 06)