## How many clusters?

February 20, 2015

Sometimes people think that a Dirichlet Process (DP) can be used to pick the “right number of clusters”.  The following plots done by Monash Matlab whiz Mark Carman show that this has to be done very carefully.

Given $N$ samples, Mark’s first plot shows the expected number of clusters, $M$, that one would get with a DP using concentration parameter $\alpha$.  The thing to notice is that the number of clusters is moderately well determined by the concentration parameter.  In fact the mean (expected value) of $M$ is given by:

$\alpha \left( \psi(\alpha+N) - \psi(\alpha)\right)$

where $\psi(\cdot)$ is the digamma function; for details see the ArXiv report by Marcus Hutter and I, Section 5.3.  Morever, the standard deviation of $M$ is approximately the square root of the mean (for larger concentrations).

So fixing a given concentration parameter means roughly fixing the number of clusters, which increases as the sample size grows.  So with a fixed concentration parameter you cannot really “estimate” the right number of clustersroughly you are fixing the value ahead of time when setting the concentration.

Mark’s second plot shows what we do to overcome this.  We have to estimate the concentration parameter as well.  So if we put a prior distribution, an Exponential with parameter $\epsilon$, on the concentration, we now smear out the previous plots.  So now we show plots for different values of $\epsilon$.  As you can see, these plots have a much higher variance, which is what you want.  With a given $\epsilon$, you are still determining the broad range of the number of clusters, but you have a lot more latitude.

In implementation, this means we estimate the concentration $\alpha$ (usually) by sampling it.  If we use Chinese restaurant processes, there is a simple auxiliary variable sampling formula for the concentration (presented in the original HDP paper by Teh et al.).  If we use our blocked table indicator sampling, the posterior on the concentration is log concave so we can use either slice sampling or adaptive rejection sampling (ARS). The implementation is moderately simple, and it works well.  However, it does mean now your algorithm will expend more time as it has to try and find the right concentration as well.

## Wray on the history of the hierarchical Pitman-Yor process

November 15, 2014

Following on from Lancelot James’ post on the Pitman-Yor process (PYP) I thought I’d follow up with key events from the machine learning perspective, including for the Dirichlet process (DP).  My focus is the hierarchical versions, the HPYP and the HDP.

The DP started to appear in the machine learning community to allow “infinite” mixture models in the early 2000’s.  Invited talks by Michael Jordan at ACL 2005 and ICML 2005 covers the main use here.  This, however, follows similar use in the statistical community.  The problem with this is that there are many ways to set up an infinite mixture model or an approximation to one, although the DP is certainly elegant here.  So this was no real innovation for statistical machine learning.

The real innovation was the introduction of the hierarchical DP (HDP) with the JASA paper by Teh, Jordan, Beal and Blei appearing in 2006, and a related hierarchical Pitman-Yor language model at ACL 2006 by Yee Whye Teh.   There are many tutorials on this, but an early tutorial by Teh covers the range from DP to HDP, “A Tutorial on Dirichlet Processes and Hierarchical Dirichlet Processes” and starts discussing the HDP about slide 37.  It is the “Chinese restaurant franchise,” a hierarchical set of Chinese restaurants, that describes the idea.  Note, however, we never use these interpretations in practice because they require dynamic memory.

These early papers went beyond the notion of “infinite mixture model” to the notion of an “hierarchical prior”.  I cover this in my tutorial slides discussing n-grams.  I consider the hierarchical Pitman-Yor language model to be one of the most important developments in statistical machine learning and a stroke of genius.  These were mostly based on algorithms using the Chinese restaurant franchise, based on the Chinese restaurant process (CRP) which is a distribution got by marginalising the probability vectors out of the DP/PYP.

Following this was a period of enrichment as various extensions of CRPs were considered, culminating in the Graphical Pitman-Yor Process developed by Wood and Teh, 2009.  CRPs continue to be used extensively in various areas of machine learning, particularly in topic models and natural language processing.  The standard Gibbs algorithms, however, are fairly slow and require dynamic memory, so do not scale well.

Variational algorithms were developed for the DP, and consequently the PYP, using the stick-breaking representation presented by Ishwaran and James in 2001, see Lancelot James’ comments.  The first of these was by Blei and Jordan 2004, and subsequently variations for many different problems were developed.

Perhaps the most innovative use of the hierarchical Pitman-Yor process is the Adaptor Grammar by Johnson, Griffiths and Goldwater in 2006.  Todate, these have been under-utilised, most likely because of the difficulty of implementation. However, I would strongly recommend any student of the HPYPs to study this work carefully because it shows both a deep understanding of HPYPs and of grammars themselves.

Our work here started appearing in 2011, well after the main innovations.  Our contribution is to develop efficient block Gibbs samplers for the HYP and the more general GPYP that require no dynamic memory.  These are not only significantly faster, they also result in significantly superior estimates.  For our non-parametric topic model, we even did a simple multi-core implementation.

• An introduction to our methods are in Lan Du‘s thesis chapters 1 and 2.  We’re still working on a good publication but the general background theory appears in our technical report “A Bayesian View of the Poisson-Dirichlet Process,” by Buntine and Hutter.
• Our samplers are a collapsed version of the corresponding CRP samplers requiring no dynamic memory so are more efficient, require less memory, and usually converge to better estimates.
• An algorithm for the GPYP case appears in Lan Du‘s PhD thesis.
• An algorithm for splitting nodes in a HPYP is in our NAACL 2013 paper (see Du, Buntine and Johnson).

## Issues about Dirichlet Process inconsistencies

January 20, 2014

I saw the NIPS 2013 paper by Miller and Harrison on “A simple example of Dirichlet process mixture inconsistency for the number of components,” and I had some issues with it.  A Dirichlet Process is a prior that says there is an infinite number of clusters in the mixture.  But at any one time, after seeing N data and a concentration parameter of θ, it expects to see about λ = θ log(N/θ) clusters plus or minus 3*sqrt(λ) or so … for N>θ>>0.  This approximation gives the famous “grows with log(N)” formula some tutorials give for DPs.   Anyway, so I cannot really see why this makes the DP inconsistent if the true model has a finite number of clusters, which is not in the prior!  It just means the DP is true to itself.  So this apparent inconsistency does not affect a Bayesian.

This seems to be a basic confusion with the Dirichlet Process generally.  Some people think it can be used to estimate the “right number of clusters”.  Well, be careful.  I can change θ and get it to estimate a large or a small number of clusters!  We do the same with the number of topics in a non-parametric topic model.