## Some diversions

July 4, 2015

Quora‘s answers on What are good ways to insult a Bayesian statistician?   Excellent.  I’m insulted 😉

Great Dice Data: How Tech Skills Connect.  “Machine learning” is there (case sensitive) under the Data Science cluster in light green.

Data scientist payscales: “machine learning”‘ raises your expected salary but “MS SQL server” lowers it!

Cheat sheetsMachine Learning Cheat Sheet and the Probability Cheat Sheet.   Very handy!

## Dirichlet-multinomial distribution versus Pitman-Yor-multinomial

March 18, 2015

If you’re familiar with the Dirichlet distribution as the workhorse for discrete Bayesian modelling, then you should know about the Dirichlet-multinomial.  This is what happens when you combine a Dirichlet with a multinomial and integrate/marginalise out the common probability vector.

Graphically, you have a probability vector $\vec\theta$ that is the mean for a probability vector $\vec{p}$.  You then have data, a count vector $\vec{n}$, drawn using a multinomial distribution with mean $\vec{p}$. If we integrate out the vector $\vec{p}$ then we get the Dirichlet-multinomial.

The standard formulation is to have:
$\vec{p} \sim \mbox{Dirichlet}\left(\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

Manipulating this, we get the distribution for the Dirichlet-multinomial:
$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N, \mbox{\small Dirichlet-multi.}\right)$
$~~~~~~~~~~~~=~ \frac{1}{\mbox{\small Beta}(\alpha \vec\theta)}{N \choose \vec{n} } \int_{\mbox{\scriptsize simplex}} \prod_k p_{k}^{n_k+\alpha \theta_k-1} \mbox{d}\vec{p}$
$~~~~~~~~~~~~=~ {N \choose \vec{n} } \frac{\mbox{\small Beta}(\vec{n}+\alpha \vec\theta)}{\mbox{\small Beta}(\alpha \vec\theta)}$

Now we will rewrite this into the Pochhammer symbol notation we use for Pitman-Yor marginals,

 $(c|d)_{T}$ is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by $c (c+d) (c+2d) ... (c+(T-1)d)$; can be computed using Gamma functions $(c)_{T}$ (special case) a related rising factorial given by $c (c+1) (c+2) ... (c+(T-1))$ so it corresponds to $(c|1)_{T}$

The Beta functions simplify to yield the Dirichlet-multinomial in Pochhammer symbol notation:

$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small Dirichlet-multi.}\right) ~=~ {N \choose \vec{n} } \frac{1}{(\alpha)_N} \prod_{k=1}^K (\alpha\theta_k)_{n_k}$

Now lets do the same with the Pitman-Yor process (PYP).
$\vec{p} \sim \mbox{PYP}\left(d,\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

The derivation of the combination is more detailed but is found in the Buntine Hutter Arxiv report or my tutorials. For this, you have to introduce a new latent vector $\vec{t}$ where $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node to convey information about the data $\vec{n}$.  Keep this phrase in mind.  It will be explained a bit later. These $\vec{t}$ are constrained as follows:

 constraint $t_k \le n_k$ constraint $t_k>0$ whenever $n_k>0$ total $N=\sum_k n_k$ total $T=\sum_k t_k$

With these, we get the PYP-multinomial in Pochhammer symbol notation:

$p\left( \vec{n},\vec{t}\,\middle\vert\, d,\alpha,\vec\theta,N,\mbox{\small PYP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{(\alpha|d)_T}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k} \theta_k^{t_k}$

You can see the three main differences. With the PYP:

1. the terms in $\vec\theta$ appear as simple powers, just like a multinomial likelihood, so this form is readily used in a hierarchy;
2. you now have to work with the generalised Stirling numbers $\mathcal{S}_{t_k, d}^{n_k}$; and
3. you have to introduce the new latent vector $\vec{t}$.

The key thing is that $\vec\theta$ only appears in the expression $\prod_{k=1}^K \theta_k^{t_k}$ which is a multinomial likelihood.   So, as said earlier, $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node. If $\vec{t}=\vec{n}$ then all the data is passed up, and the likelihood looks like $\prod_{k=1}^K \theta_k^{n_k}$, which is what you would get if $\vec{p}=\vec{\theta}$.

If we use a Dirichlet process (DP) rather than a PYP, the only simplification is that $(\alpha|d)_T$ simplifies to $\alpha^T$.  This small change means that the above formula can be rewritten as:

$p\left( \vec{n},\vec{t}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small DP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{1}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k}(\alpha\theta_k)^{t_k}$

This has quite broad implications as the $t_k$ are now independent variables!  In fact, their posterior distribution takes the form:

$p\left( t_k\,\middle\vert\, n_k,\alpha,\theta_k\right)~=~ \mathcal{S}_{t_k, 0}^{n_k} \frac{(\alpha\theta_k)^{t_k}}{ (\alpha\theta_k)_{n_k} }$

The penalty for using the PYP or DP is that you now have to work with the generalised Stirling numbers and the latent vector $\vec{t}$.  With the right software, the Stirling numbers are easy to handle.  While my students have built their own Java packages for that, I use a C library available from MLOSS.  The latent vectors $\vec{t}$ at each node require different sampling algorithms.

Now one final note, this isn’t how we implement these.  Sampling the full range of $\vec{t}$ is too slow … there are too many possibilities.  Moreover, if sampling $\vec{t}$, you ignore the remainder of the hierarchy.  For fast mixing of Markov chains, you want to sample a cluster of related variables.  The hierarchical CRP does this implicitly as it resamples and samples up and down the parent hierarchy.  So to achieve this same effect using the marginalised posteriors above we have too Booleanise (turn into a Boolean) the $\vec{t}$‘s and sample a cluster of related Booleans up and down the hierarchy.  We figure out how to do this in 2011, paper at ECML.

## How many clusters?

February 20, 2015

Sometimes people think that a Dirichlet Process (DP) can be used to pick the “right number of clusters”.  The following plots done by Monash Matlab whiz Mark Carman show that this has to be done very carefully.

Given $N$ samples, Mark’s first plot shows the expected number of clusters, $M$, that one would get with a DP using concentration parameter $\alpha$.  The thing to notice is that the number of clusters is moderately well determined by the concentration parameter.  In fact the mean (expected value) of $M$ is given by:

$\alpha \left( \psi(\alpha+N) - \psi(\alpha)\right)$

where $\psi(\cdot)$ is the digamma function; for details see the ArXiv report by Marcus Hutter and I, Section 5.3.  Morever, the standard deviation of $M$ is approximately the square root of the mean (for larger concentrations).

So fixing a given concentration parameter means roughly fixing the number of clusters, which increases as the sample size grows.  So with a fixed concentration parameter you cannot really “estimate” the right number of clustersroughly you are fixing the value ahead of time when setting the concentration.

Mark’s second plot shows what we do to overcome this.  We have to estimate the concentration parameter as well.  So if we put a prior distribution, an Exponential with parameter $\epsilon$, on the concentration, we now smear out the previous plots.  So now we show plots for different values of $\epsilon$.  As you can see, these plots have a much higher variance, which is what you want.  With a given $\epsilon$, you are still determining the broad range of the number of clusters, but you have a lot more latitude.

In implementation, this means we estimate the concentration $\alpha$ (usually) by sampling it.  If we use Chinese restaurant processes, there is a simple auxiliary variable sampling formula for the concentration (presented in the original HDP paper by Teh et al.).  If we use our blocked table indicator sampling, the posterior on the concentration is log concave so we can use either slice sampling or adaptive rejection sampling (ARS). The implementation is moderately simple, and it works well.  However, it does mean now your algorithm will expend more time as it has to try and find the right concentration as well.

## Lancelot James’ general theory for IBPs

February 9, 2015

When I visited Hong Kong last December, Lancelot James gave me a tutorial on his generalised theory of Indian Buffet Processes, which includes all sorts of interesting extensions that are starting to appear in the machine learning community.

The theory is written up in his recent Arxiv report, “Poisson Latent Feature Calculus for Generalized Indian Buffet Processes.”  I’m a big fan of Poisson processes, so its good to see the general theory written up, including a thorough posterior analysis.  In my recent tutorial (given at Aalto) I’ve included a simplified version of this with the key marginal posterior for a few well known cases.  Piyush Rai from Duke also has great summary slides as part of Lawrence Carin’s reading group at Duke.

Within this framework, you can now handle variations like the Beta-negative-Binomial Process of Mingyuan Zhou and colleagues presented at AI Stats 2012 (Arxiv version here), for which many different variations can be developed.

I’ve presented the general case as an application of improper priors because the Poisson process theory in this case is mostly just formalising their use for developing “infinite parameter vectors”.  Lancelot uses his Poisson Process Partition Calculus to develop a clean, general theory.  For me it was interesting to see that his results also apply to multivariate and auxiliary parameter vectors:

• an infinite vector of Gamma scales $\vec\beta$ created using a Gamma process can have an auxiliary vector of Gamma shapes $\vec\alpha$ sampled pointwise, so $\alpha_k$ is associated with $\beta_k$
• most of the scales are infinitesimal
• most of the shapes are not infinitesimal
• an infinite set of $K+1$-dimensional multinomial’s can be created where the first $K$ dimensions are generated using a Dirichlet style Poisson rate:
• most of the multinomials will have infinitesimal values in the first $K$ dimensions
• the Beta process corresponds to the case where $K=1$

## Latent IBP compound Dirichlet Allocation

January 30, 2015

I visited Ralf Herbrich’s new Amazon offices in Berlin on 8th January and chatted there on topic modelling.  A pleasant result for me was meeting both Cédric Archambeau and Jan Gasthaus, both having been active in my area.

With Cedric I discussed Latent IBP compound Dirichlet Allocation (in the long awaited IEEE Trans PAMI 2015 special issue on Bayesian non-parametics, PDF on Cédric’s webpage) model which combines a 3-parameter Indian Buffet Process with Dirichlets.  This was joint work with Balaji Lakshminarayanan and Guillaume Bouchard.

Their work improves on the earlier paper by Williamson, Wang, Heller, and Blei (2010) on the Focused Topic Model which struggled with using the IBP theory.  I’d originally ignored Williamson et al.s’ work because they only tested on toy data sets, despite it being such a great model.  The marginal posterior for the document-topic indicator matrix is given in Archambeau et al.s’ Equation (43) which they attribute to Teh and Görür (NIPS 2009) but its easily derived using Dirichlet marginals and Lancelot James’ general formulas for IBPs.  This is the so-called IBP compound Dirichlet. From there, its easy to derive a collapsed Gibbs sampler mirroring regular LDA.  Theory and sampling hyper-parameters for the 3-parameter IBP I describe in my Helsinki talk and coming tutorial.

This is a great paper with quality empirical work and the best results I’ve seen for non-parametric LDA (other than ours).  Their implementation isn’t performance tuned so their timing figures are not that indicative, but they ran on non-trivial data sets so its good enough.

Note for the curious, I ran our HCA code to duplicate their experimental results on the two larger data sets.  Details in the Helsinki talk.  Basically:

• They compared against prior HDP-LDA implementations so of course beat them substantially.
• Our version of HDP-LDA (without burstiness) works as well as their LIDA algorithm, their better one with IBPs on the topic and the word side, and is substantially faster.
• Our fully non-parametric LDA (DPs for documents, PYPs for words, no burstiness) beat their LIDA substantially.

So while we beat them with our superior collapsed Gibbs sampling, their results are impressive so I’m excited by the possibility of trying their methods.

## Non-reversible operators for MCMC

January 15, 2015

Been visiting University of Helsinki since Christmas and there is Jukka Corander, a Bayesian statistician who works on variants of MCMC and pseudo-likelihood, ways of scaling up statistical computation.  He showed me his 2006 Statistics and Computing paper on “Bayesian model learning based on a parallel MCMC strategy,” (PDF around if you search) and I have to say I’m amazed.  This is so important, as anyone who tries MCMC in complex spaces would know.  The reason for wanting these is:

• proposal operators for things like split-merge must propose *reasonable* alternatives and therefore this must be done with a non-trivial operator
• e.g.,  greedy search is used to build an initial split for the proposal
• developing the reverse operator for these is very hard

So Jukka’s groups result is that reversible MCMC is not necessary.  As long as the usual Metropolis-Hastings acceptance condition applies, the MCMC process converges in the long term.

Anyway, I can now build split-merge operators using MCMC without requiring crazy reversability!

## Wray on the history of the hierarchical Pitman-Yor process

November 15, 2014

Following on from Lancelot James’ post on the Pitman-Yor process (PYP) I thought I’d follow up with key events from the machine learning perspective, including for the Dirichlet process (DP).  My focus is the hierarchical versions, the HPYP and the HDP.

The DP started to appear in the machine learning community to allow “infinite” mixture models in the early 2000’s.  Invited talks by Michael Jordan at ACL 2005 and ICML 2005 covers the main use here.  This, however, follows similar use in the statistical community.  The problem with this is that there are many ways to set up an infinite mixture model or an approximation to one, although the DP is certainly elegant here.  So this was no real innovation for statistical machine learning.

The real innovation was the introduction of the hierarchical DP (HDP) with the JASA paper by Teh, Jordan, Beal and Blei appearing in 2006, and a related hierarchical Pitman-Yor language model at ACL 2006 by Yee Whye Teh.   There are many tutorials on this, but an early tutorial by Teh covers the range from DP to HDP, “A Tutorial on Dirichlet Processes and Hierarchical Dirichlet Processes” and starts discussing the HDP about slide 37.  It is the “Chinese restaurant franchise,” a hierarchical set of Chinese restaurants, that describes the idea.  Note, however, we never use these interpretations in practice because they require dynamic memory.

These early papers went beyond the notion of “infinite mixture model” to the notion of an “hierarchical prior”.  I cover this in my tutorial slides discussing n-grams.  I consider the hierarchical Pitman-Yor language model to be one of the most important developments in statistical machine learning and a stroke of genius.  These were mostly based on algorithms using the Chinese restaurant franchise, based on the Chinese restaurant process (CRP) which is a distribution got by marginalising the probability vectors out of the DP/PYP.

Following this was a period of enrichment as various extensions of CRPs were considered, culminating in the Graphical Pitman-Yor Process developed by Wood and Teh, 2009.  CRPs continue to be used extensively in various areas of machine learning, particularly in topic models and natural language processing.  The standard Gibbs algorithms, however, are fairly slow and require dynamic memory, so do not scale well.

Variational algorithms were developed for the DP, and consequently the PYP, using the stick-breaking representation presented by Ishwaran and James in 2001, see Lancelot James’ comments.  The first of these was by Blei and Jordan 2004, and subsequently variations for many different problems were developed.

Perhaps the most innovative use of the hierarchical Pitman-Yor process is the Adaptor Grammar by Johnson, Griffiths and Goldwater in 2006.  Todate, these have been under-utilised, most likely because of the difficulty of implementation. However, I would strongly recommend any student of the HPYPs to study this work carefully because it shows both a deep understanding of HPYPs and of grammars themselves.

Our work here started appearing in 2011, well after the main innovations.  Our contribution is to develop efficient block Gibbs samplers for the HYP and the more general GPYP that require no dynamic memory.  These are not only significantly faster, they also result in significantly superior estimates.  For our non-parametric topic model, we even did a simple multi-core implementation.

• An introduction to our methods are in Lan Du‘s thesis chapters 1 and 2.  We’re still working on a good publication but the general background theory appears in our technical report “A Bayesian View of the Poisson-Dirichlet Process,” by Buntine and Hutter.
• Our samplers are a collapsed version of the corresponding CRP samplers requiring no dynamic memory so are more efficient, require less memory, and usually converge to better estimates.
• An algorithm for the GPYP case appears in Lan Du‘s PhD thesis.
• An algorithm for splitting nodes in a HPYP is in our NAACL 2013 paper (see Du, Buntine and Johnson).