## Training a Pitman-Yor process tree with observed data at the leaves (part 1)

November 12, 2014

A simple problem with hierarchical Pitman-Yor processes (PYPs) or Dirichlet processes (DP) is where you have a hierarchy of probability vectors and data only at the leaf nodes.  The task is to estimate the probabilities at the leaf nodes.  The hierarchy is used to allow sharing across children without resorting to Bayesian model averaging as I did for decision trees (of the Ross Quinlan variety) in my 1992 PhD thesis (journal article here) and done for n-grams with the famous Context Tree Weighting compression method.

The technique used here is somewhat similar to Katz’s back-off smoothing for n-grams but the discounts and back-off weight are estimated differently.  Yeh Whye Teh created this model originally, and it is published in an ACL 2006 paper, though the real theory is buried in an NUS technical report (yes, one of those amazing unpublished manuscripts). However, his sampler is not very good, so the purpose of this note is to explain a better algorithm.  The error in his logic is exposed in the following quote (page 16 of the amazing NUS manuscript):

However each iteration is computationally intensive as each $c_{uw\cdot}$ and $t_{uw}$ can potentially take on many values, and it is expensive to compute the generalized Stirling numbers $s_d (c, t)$.

We use simple approximations and caching to overcome these problems. His technique, however, has generated a lot of interest and a lot of papers use it.  But the new sampler/estimate is a lot faster, gives substantially better predictive results, and requires no dynamic memory.  Note that when the data at the leaves is also latent, such as with clustering, topic models, or HMMs, then completely different methods are needed to enable faster mixing.  This note is only about the simplest case where the data at the leaves is observed.

The basic task is we want to estimate a probability of the $k$-th outcome at a node $\theta$, call it $\hat{p}^\theta_k$, by smoothing the observed data $n^\theta_k$ at the node with the estimated probability at the parent node $\phi$ denoted $\hat{p}^\phi_k$.  The formula used by Teh is as follows:

$\hat{p}^\theta_k ~=~ \frac{n^\theta_k - t^\theta_k d}{N^\theta+c} + \frac{c+T^\theta \,d}{N^\theta+c} \,\, \hat{p}^\phi_k$             (1)

In this, the pair $(d,c)$ are the parameters of the PYP called discount and concentration respectively, where $0 \le d < 1$ and $c \ge -d$.  The $N$s and $T$s are totals, so $N^\mu=\sum_k n^\mu_k$ and $T^\mu=\sum_k t^\mu_k$ for all nodes $\mu$ in the graph.  The DP is the case where $d=0$.

For discount of $d=0$, its easy to see that for large amounts of data the probability estimate converges to the observed frequency, so this behaves well.

If the data is at the leaf nodes, where do the counts $n^\mu_k$ for non-leaf nodes $\mu$ come from?  Easy, they are totalled from their children’s $t$ counts $n^\mu_k=\sum_{\phi \in \mu} t^\phi_k$ (using $\phi \in \mu$ to denote $\phi$ are the children).

But what then are the $t^\mu_k$?  In the Chinese Restaurant Process (CRP) view of a PYP, these are the “number of tables” at the node.  To understand this, you could go and spend a few hours studying CRPs.  You don’t need to if you don’t know it already.  The way to think of it is:

The $t^\mu_k$ are the subset of counts $n^\mu_k$ that you will pass from the node $\mu$ to its parent $\phi$ in the form of a multinomial message.  So the contribution from $\mu$ to $\phi$ is a likelihood with the form $\prod_k \phi_k^{t^\mu_k}$.  The more or less counts you pass up the stronger or weaker you are making the message.

The idea is that if we expect the probability vector at $\mu$ to be very similar to the vector at $\phi$, then most of the data should pass up, so $t^\mu_k$ is close to $n^\mu_k$. In this case, the first term in Equation (1) shrinks and the loss is transferred via $T^\mu$ so the parent probability contributes more to the estimate. Conversely, if the probability vector at $\mu$ should be quite different, then only a small amount of data should pass up and $t^\mu_k$ is closer to 1 when $n^\mu_k>0$.  Consequently $T^\mu$ is smaller so the parent probability contributes less to the estimate.   Similarly, increasing the concentration makes the first term smaller and allows the parent probability to contribute more to the estimate, and vice versa.  Finally, note for the DP the $t^\mu_k$ don’t seem to appear in Equation (1), but in reality thet do they are just hidden.  They have propagated up to the parent counts so are used in the parent probability estimate $\hat{p}^\phi_k$.  So the more of them there are, the more confident you are about your parent’s probability.

PYP theory and sampling has a property shared by many other species sampling distributions that this just all works.   But there is some effort in sampling the $\vec{t}^\mu$.

This Equation (1) is a recursive formula.  The task of estimating probabilities is now reduced to:

1. Build the tree of nodes and populate the data in counts $\vec{n}^\theta$ at leaf nodes $\theta$.
2. Run some algorithm to estimate the $\vec{t}^\mu$ for nodes $\mu$, and hence the $\vec{n}^\mu$ for non-leaf nodes $\mu$.
3. Estimate the probabilities from the root node down using Equation (1).

So the whole procedure for probability estimate is very simple, except for the two missing bits:

• How do you estimate the $\vec{t}^\mu$ (and thus $\vec{n}^\mu$ for non-leaf nodes)?
• How do you estimate the discount and concentration parameters?  By the way, we can vary these per node too!

We will look at the first of these two tasks next.

### One comment

1. […] Topical insights into components of prior research and aspects of latent academia « Training a Pitman-Yor process tree with observed data at the leaves (part 1) […]