A Fast And Scalable Topic-Modeling ToolboxResearch software by Arthur Asuncion and colleagues at the University of California, Irvine. |
Probabilistic topic models such as Latent Dirichlet Allocation are popular in machine learning since they can be effectively used to analyze many different types of data, such as text corpora, image databases, biological data, social networks, and collaborative filtering data. While various topic modeling software packages are already available online (such as Mark Steyvers' Matlab toolbox), this collection of software focuses on efficient and scalable inference algorithms for topic models. Improving the computational aspects of topic modeling is of growing importance due to the emergence of massive data sets and applications which require fast semantic analysis.
This toolbox includes parallel/distributed algorithms, statistical acceleration techniques, and efficient inference methods for more specialized models:
Note that these methods can be combined to achieve multiplicative speedups. For instance, one can combine parallel computing with fast collapsed variational inference to learn a topic model in a few seconds on corpora with thousands of documents (e.g. see our UAI paper).
Many data sets of interest are too large to be analyzed on a single computer; thus, it is of great interest to use parallel and distributed computing to learn topic models. Our approximate distributed Gibbs sampling algorithms have been shown to achieve solutions as accurate as the sequential samplers, while significantly decreasing the time and memory requirements. Here is Matlab simulation code for distributed topic modeling, as well as Message Passing Interface (MPI) code which can be run on a cluster.
References:
D. Newman, A. Asuncion, P. Smyth, M. Welling. "Distributed Algorithms for Topic Models." JMLR 2009.
D. Newman, A. Asuncion, P. Smyth, M. Welling. "Distributed Inference for Latent Dirichlet Allocation." NIPS 2007.
In some cases, a global synchronization step across all processors is not feasible. In these situations, it is possible to learn topic models in a fully asynchronous distributed environment, where processors in a network communicate in pairwise fashion. This approach also converges to high-quality solutions. Here is Matlab code for asynchronous LDA.
References:
A. Asuncion, P. Smyth, M. Welling. "Asynchronous Distributed Algorithms of Topic Models for Document Analysis." Statistical Methodology 2010.
A. Asuncion, P. Smyth, M. Welling. "Asynchronous Distributed Learning of Topic Models." NIPS 2008.
Each step of collapsed Gibbs sampling for LDA can be accelerated if one can cheaply estimate the normalization constant of the conditional distribution, since one can then skip many steps of computing the probabilities in a discrete distribution. This approach is known as Fast-LDA -- the details are in the KDD paper listed below. Here is a tarball containing the Fast-LDA code.
References:
I. Porteous, D. Newman, A. Ihler, A. Asuncion, P. Smyth, M. Welling. "Fast Collapsed Gibbs Sampling for Latent Dirichlet Allocation." KDD 2008.
Over the past decade, a variety of topic modeling inference algorithms have been developed, including collapsed Gibbs sampling, variational inference, collapsed variational inference, and MAP estimation. Interestingly, for LDA, the empirical differences between these techniques can be minimized if one optimizes the hyperparameters, suggesting that one is free to pick the most efficient algorithm. We investigated a fast collapsed variational Bayesian inference algorithm (named CVB0) that can learn topic models on moderately-sized corpora in near real-time. The C/MPI code for CVB0 is here.
References:
A. Asuncion, M. Welling, P. Smyth, Y.W. Teh. "On Smoothing and Inference for Topic Models." UAI 2009..
One interesting extension of LDA that is relevant to social network analysis is the Relational Topic Model, which assumes that count data resides on each node of a network, and that the probability of a link between two nodes is a function of the topical similarity between the nodes. Here is a collapsed Gibbs sampling Matlab implementation of RTM (also, this little sampling note may be useful as well).
References:
J. Chang, D. Blei. "Relational Topic Models for Document Networks." AISTATS 2009.
Maintained by Arthur Asuncion. By using this software, you signify that we are not liable for anything.