Efficient Collapsed Gibbs Sampling for Latent Dirichlet Allocation
Han Xiao, Technical University of Munich, Germany
artex.xh@gmail.com
Last update: Oct. 1, 2010
Download

What is ECGS?
Efficient Collapsed Gibbs Sampling (ECGS) is a C++ implementation for parameter estimation of latent Dirichlet allocation model. ECGS optimizes the original collapsed Gibbs sampling from two perspectives. First, an auxiliary multinomial is incorporated into the estimation process to avoid redundant sampling (see our paper for details). Second, several tricks and heuristics are integrated in our implementation including: binary search, using more compact data structure and more efficient random number generator. On three popular bag-of-words data set (KOS, NIPS, NYTimes), ECGS demonstrates 4-6x speedup compared to GibbsLDA and 2-3x speedup compared to FastLDA, meanwhile it retains all the optimality guarantees associated with standard Gibbs sampling.

How to use it?
Fisrt, download it and extract the codes to any directory on your disk.
1. make clean
2. make all
3. ecgs -alpha <double> -beta <double> -gamma <double> -ntopics <int> -niters <int> -savestep <int> -twords <int> -dfile <string>

Have a try by dowloading test.txt, and type:
./ecgs -alpha 0.1 -beta 0.1 -gamma 1 -ntopics 40 -niters 2000 -savestep 500 -twords 20 -dfile test.txt

Parameters     Type           Description
alpha double Hyper-parameter of LDA, often set as (50/ntopics).
beta double Another hyper-parameter of LDA, often set as 0.1.
gamma int Dumping factor, often set as 1. Larger value may help to retain more optimality, see our paper for details of this parameter.
ntopics int The number of topics.
niters int The number of Gibbs sampling iterations.
savestep int The step at which the LDA model is saved to hard disk.
twords int The number of most likely words for each topic.
dfile file The input training data file.

What is the format of input and output file?
Input:
The format of a training file is described as follows:

    [M]
    [document1]
    [document2]
    ...
    [documentM]

in which the first line is the total number for documents [M]. Each line after that is one document. [documenti] is the ith document of the dataset that consists of a list of N tokens.

    [documenti] = [word_i1] [word_i2] ... [word_iN]

in which all [wordij] (i=1..M, j=1..N) are text strings and they are separated by the blank character.

Output:
.phi: This file contains the word-topic distributions, i.e., p(w|z). Each line is a topic, each column is a word in the vocabulary.
.theta: This file contains the topic-document distributions, i.e.,  p(z|d). Each line is a document and each column is a topic.
wordmap.txt: The file contains the maps between words and word's IDs (integer).

Update log:
2010/10/01        Fix the broken test file.
2010/09/28        Fix bugs while saving the models. Use "Mersenne Twister" random number generator, which is 1-2x faster than C/C++ internal rand().
2010/07/12        First release

Citation:
@inproceedings{Xiao_Stibor_2010,
  author = {Han Xiao and Thomas Stibor},
  title = {Efficient Collapsed Gibbs Sampling For Latent Dirichlet Allocation},
  booktitle = {Asian Conference on Machine Learning (ACML)},
  year = {2010},
}