A Fast And Scalable Topic-Modeling Toolbox

Research software by Arthur Asuncion and colleagues at the University of California, Irvine.


Probabilistic topic models such as Latent Dirichlet Allocation are popular in machine learning since they can be effectively used to analyze many different types of data, such as text corpora, image databases, biological data, social networks, and collaborative filtering data. While various topic modeling software packages are already available online (such as Mark Steyvers' Matlab toolbox), this collection of software focuses on efficient and scalable inference algorithms for topic models. Improving the computational aspects of topic modeling is of growing importance due to the emergence of massive data sets and applications which require fast semantic analysis.

This toolbox includes parallel/distributed algorithms, statistical acceleration techniques, and efficient inference methods for more specialized models:

Note that these methods can be combined to achieve multiplicative speedups. For instance, one can combine parallel computing with fast collapsed variational inference to learn a topic model in a few seconds on corpora with thousands of documents (e.g. see our UAI paper).


Distributed collapsed Gibbs sampling for LDA:

Many data sets of interest are too large to be analyzed on a single computer; thus, it is of great interest to use parallel and distributed computing to learn topic models. Our approximate distributed Gibbs sampling algorithms have been shown to achieve solutions as accurate as the sequential samplers, while significantly decreasing the time and memory requirements. Here is Matlab simulation code for distributed topic modeling, as well as Message Passing Interface (MPI) code which can be run on a cluster.

References:


Asynchronous distributed collapsed Gibbs sampling for LDA:

In some cases, a global synchronization step across all processors is not feasible. In these situations, it is possible to learn topic models in a fully asynchronous distributed environment, where processors in a network communicate in pairwise fashion. This approach also converges to high-quality solutions. Here is Matlab code for asynchronous LDA.

References:


Fast collapsed Gibbs sampling for LDA:

Each step of collapsed Gibbs sampling for LDA can be accelerated if one can cheaply estimate the normalization constant of the conditional distribution, since one can then skip many steps of computing the probabilities in a discrete distribution. This approach is known as Fast-LDA -- the details are in the KDD paper listed below. Here is a tarball containing the Fast-LDA code.

References:


Fast collapsed variational inference for LDA:

Over the past decade, a variety of topic modeling inference algorithms have been developed, including collapsed Gibbs sampling, variational inference, collapsed variational inference, and MAP estimation. Interestingly, for LDA, the empirical differences between these techniques can be minimized if one optimizes the hyperparameters, suggesting that one is free to pick the most efficient algorithm. We investigated a fast collapsed variational Bayesian inference algorithm (named CVB0) that can learn topic models on moderately-sized corpora in near real-time. The C/MPI code for CVB0 is here.

References:


Efficient collapsed Gibbs sampling for the Relational Topic Model:

One interesting extension of LDA that is relevant to social network analysis is the Relational Topic Model, which assumes that count data resides on each node of a network, and that the probability of a link between two nodes is a function of the topical similarity between the nodes. Here is a collapsed Gibbs sampling Matlab implementation of RTM (also, this little sampling note may be useful as well).

References:


Maintained by Arthur Asuncion. By using this software, you signify that we are not liable for anything.