## Dirichlet-multinomial distribution versus Pitman-Yor-multinomial

March 18, 2015

If you’re familiar with the Dirichlet distribution as the workhorse for discrete Bayesian modelling, then you should know about the Dirichlet-multinomial.  This is what happens when you combine a Dirichlet with a multinomial and integrate/marginalise out the common probability vector.

Graphically, you have a probability vector $\vec\theta$ that is the mean for a probability vector $\vec{p}$.  You then have data, a count vector $\vec{n}$, drawn using a multinomial distribution with mean $\vec{p}$. If we integrate out the vector $\vec{p}$ then we get the Dirichlet-multinomial.

The standard formulation is to have:
$\vec{p} \sim \mbox{Dirichlet}\left(\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

Manipulating this, we get the distribution for the Dirichlet-multinomial:
$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N, \mbox{\small Dirichlet-multi.}\right)$
$~~~~~~~~~~~~=~ \frac{1}{\mbox{\small Beta}(\alpha \vec\theta)}{N \choose \vec{n} } \int_{\mbox{\scriptsize simplex}} \prod_k p_{k}^{n_k+\alpha \theta_k-1} \mbox{d}\vec{p}$
$~~~~~~~~~~~~=~ {N \choose \vec{n} } \frac{\mbox{\small Beta}(\vec{n}+\alpha \vec\theta)}{\mbox{\small Beta}(\alpha \vec\theta)}$

Now we will rewrite this into the Pochhammer symbol notation we use for Pitman-Yor marginals,

 $(c|d)_{T}$ is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by $c (c+d) (c+2d) ... (c+(T-1)d)$; can be computed using Gamma functions $(c)_{T}$ (special case) a related rising factorial given by $c (c+1) (c+2) ... (c+(T-1))$ so it corresponds to $(c|1)_{T}$

The Beta functions simplify to yield the Dirichlet-multinomial in Pochhammer symbol notation:

$p\left( \vec{n}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small Dirichlet-multi.}\right) ~=~ {N \choose \vec{n} } \frac{1}{(\alpha)_N} \prod_{k=1}^K (\alpha\theta_k)_{n_k}$

Now lets do the same with the Pitman-Yor process (PYP).
$\vec{p} \sim \mbox{PYP}\left(d,\alpha,\vec\theta\right) ~~~~~~~~~~~~ \vec{n} \sim \mbox{Multinomial}\left(\vec{p},N\right)$

The derivation of the combination is more detailed but is found in the Buntine Hutter Arxiv report or my tutorials. For this, you have to introduce a new latent vector $\vec{t}$ where $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node to convey information about the data $\vec{n}$.  Keep this phrase in mind.  It will be explained a bit later. These $\vec{t}$ are constrained as follows:

 constraint $t_k \le n_k$ constraint $t_k>0$ whenever $n_k>0$ total $N=\sum_k n_k$ total $T=\sum_k t_k$

With these, we get the PYP-multinomial in Pochhammer symbol notation:

$p\left( \vec{n},\vec{t}\,\middle\vert\, d,\alpha,\vec\theta,N,\mbox{\small PYP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{(\alpha|d)_T}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k} \theta_k^{t_k}$

You can see the three main differences. With the PYP:

1. the terms in $\vec\theta$ appear as simple powers, just like a multinomial likelihood, so this form is readily used in a hierarchy;
2. you now have to work with the generalised Stirling numbers $\mathcal{S}_{t_k, d}^{n_k}$; and
3. you have to introduce the new latent vector $\vec{t}$.

The key thing is that $\vec\theta$ only appears in the expression $\prod_{k=1}^K \theta_k^{t_k}$ which is a multinomial likelihood.   So, as said earlier, $t_k$ represents the subset of the count $n_k$ that will be passed up from the $\vec{p}$-node up to the $\vec{\theta}$-node. If $\vec{t}=\vec{n}$ then all the data is passed up, and the likelihood looks like $\prod_{k=1}^K \theta_k^{n_k}$, which is what you would get if $\vec{p}=\vec{\theta}$.

If we use a Dirichlet process (DP) rather than a PYP, the only simplification is that $(\alpha|d)_T$ simplifies to $\alpha^T$.  This small change means that the above formula can be rewritten as:

$p\left( \vec{n},\vec{t}\,\middle\vert\, \alpha,\vec\theta,N,\mbox{\small DP-multi.}\right) ~=~ {N \choose \vec{n}} \frac{1}{(\alpha)_N} \prod_{k = 1}^K \mathcal{S}_{t_k, d}^{n_k}(\alpha\theta_k)^{t_k}$

This has quite broad implications as the $t_k$ are now independent variables!  In fact, their posterior distribution takes the form:

$p\left( t_k\,\middle\vert\, n_k,\alpha,\theta_k\right)~=~ \mathcal{S}_{t_k, 0}^{n_k} \frac{(\alpha\theta_k)^{t_k}}{ (\alpha\theta_k)_{n_k} }$

The penalty for using the PYP or DP is that you now have to work with the generalised Stirling numbers and the latent vector $\vec{t}$.  With the right software, the Stirling numbers are easy to handle.  While my students have built their own Java packages for that, I use a C library available from MLOSS.  The latent vectors $\vec{t}$ at each node require different sampling algorithms.

Now one final note, this isn’t how we implement these.  Sampling the full range of $\vec{t}$ is too slow … there are too many possibilities.  Moreover, if sampling $\vec{t}$, you ignore the remainder of the hierarchy.  For fast mixing of Markov chains, you want to sample a cluster of related variables.  The hierarchical CRP does this implicitly as it resamples and samples up and down the parent hierarchy.  So to achieve this same effect using the marginalised posteriors above we have too Booleanise (turn into a Boolean) the $\vec{t}$‘s and sample a cluster of related Booleans up and down the hierarchy.  We figure out how to do this in 2011, paper at ECML.

## Wray on the history of the hierarchical Pitman-Yor process

November 15, 2014

Following on from Lancelot James’ post on the Pitman-Yor process (PYP) I thought I’d follow up with key events from the machine learning perspective, including for the Dirichlet process (DP).  My focus is the hierarchical versions, the HPYP and the HDP.

The DP started to appear in the machine learning community to allow “infinite” mixture models in the early 2000’s.  Invited talks by Michael Jordan at ACL 2005 and ICML 2005 covers the main use here.  This, however, follows similar use in the statistical community.  The problem with this is that there are many ways to set up an infinite mixture model or an approximation to one, although the DP is certainly elegant here.  So this was no real innovation for statistical machine learning.

The real innovation was the introduction of the hierarchical DP (HDP) with the JASA paper by Teh, Jordan, Beal and Blei appearing in 2006, and a related hierarchical Pitman-Yor language model at ACL 2006 by Yee Whye Teh.   There are many tutorials on this, but an early tutorial by Teh covers the range from DP to HDP, “A Tutorial on Dirichlet Processes and Hierarchical Dirichlet Processes” and starts discussing the HDP about slide 37.  It is the “Chinese restaurant franchise,” a hierarchical set of Chinese restaurants, that describes the idea.  Note, however, we never use these interpretations in practice because they require dynamic memory.

These early papers went beyond the notion of “infinite mixture model” to the notion of an “hierarchical prior”.  I cover this in my tutorial slides discussing n-grams.  I consider the hierarchical Pitman-Yor language model to be one of the most important developments in statistical machine learning and a stroke of genius.  These were mostly based on algorithms using the Chinese restaurant franchise, based on the Chinese restaurant process (CRP) which is a distribution got by marginalising the probability vectors out of the DP/PYP.

Following this was a period of enrichment as various extensions of CRPs were considered, culminating in the Graphical Pitman-Yor Process developed by Wood and Teh, 2009.  CRPs continue to be used extensively in various areas of machine learning, particularly in topic models and natural language processing.  The standard Gibbs algorithms, however, are fairly slow and require dynamic memory, so do not scale well.

Variational algorithms were developed for the DP, and consequently the PYP, using the stick-breaking representation presented by Ishwaran and James in 2001, see Lancelot James’ comments.  The first of these was by Blei and Jordan 2004, and subsequently variations for many different problems were developed.

Perhaps the most innovative use of the hierarchical Pitman-Yor process is the Adaptor Grammar by Johnson, Griffiths and Goldwater in 2006.  Todate, these have been under-utilised, most likely because of the difficulty of implementation. However, I would strongly recommend any student of the HPYPs to study this work carefully because it shows both a deep understanding of HPYPs and of grammars themselves.

Our work here started appearing in 2011, well after the main innovations.  Our contribution is to develop efficient block Gibbs samplers for the HYP and the more general GPYP that require no dynamic memory.  These are not only significantly faster, they also result in significantly superior estimates.  For our non-parametric topic model, we even did a simple multi-core implementation.

• An introduction to our methods are in Lan Du‘s thesis chapters 1 and 2.  We’re still working on a good publication but the general background theory appears in our technical report “A Bayesian View of the Poisson-Dirichlet Process,” by Buntine and Hutter.
• Our samplers are a collapsed version of the corresponding CRP samplers requiring no dynamic memory so are more efficient, require less memory, and usually converge to better estimates.
• An algorithm for the GPYP case appears in Lan Du‘s PhD thesis.
• An algorithm for splitting nodes in a HPYP is in our NAACL 2013 paper (see Du, Buntine and Johnson).

## Lancelot James on the history of the Pitman-Yor Process

November 10, 2014

Those of us in Machine Learning (read Computer Science) departments rely on the work of many statisticians.  While Jim Pitman (UCB) and colleagues has been instrumental in developing the fundamental theory, other statisticians such as Lancelot James (HKUST) have been extending the work and interpreting it so the rest of us can understand it.  Marcus Hutter (ANU) developed a bunch of formula, and I remember him being shattered to discover his toughest “exact formula for the second order Stirling number” was published by Toscano in 1939, and Lancelot cites Halmos 1944 on stick-breaking!

So its important to know our history.  Here is Lancelot’s short history of the Pitman-Yor Process, he wrote during correspondence with us, and my students and I are most grateful for his many insights (some edits by me):

• The process arises from a comprehensive study of excursion lengths on Markov processes, where an excursion length is the time from one zero-crossing to the next:
•  Excursions can either be positive or negative. So the (P k) represent the length of excursions of a process on [0,1].  The sign of the excursion is not reflected in the law of (P k). In fact it is independent of the length.
• A class of Bessel Processes indexed by a parameter 0 ≤ α < 1 in a series of works of Jim Pitman and Marc Yor and co-authors during the 1980’s and 1990’s, generalizing results for Brownian motion (which is the α = 1/2 case).
• The study of (P k) following a two parameter Poisson-Dirichlet distribution with parameters (α, θ) denoted as PD(α, θ). PD(1/2, 0), corresponds to Brownian Motion and PD(1/2, 1/2), corresponds to Brownian Bridge.
• Its stick-breaking representation is formally derived in Perman, Pitman and Yor (1992):
• This process, but not Bayesian aspects of it, is later discussed by Pitman and Yor:
• The corresponding random distribution function is treated in a formal Bayesian setting by Pitman: