## Training a Pitman-Yor process tree with observed data at the leaves (part 1)

November 12, 2014A simple problem with hierarchical Pitman-Yor processes (PYPs) or Dirichlet processes (DP) is where you have a hierarchy of probability vectors and data only at the leaf nodes. The task is to estimate the probabilities at the leaf nodes. The hierarchy is used to allow sharing across children without resorting to Bayesian model averaging as I did for decision trees (of the Ross Quinlan variety) in my 1992 PhD thesis (journal article here) and done for n-grams with the famous Context Tree Weighting compression method.

The technique used here is somewhat similar to Katz’s back-off smoothing for n-grams but the discounts and back-off weight are estimated differently. Yeh Whye Teh created this model originally, and it is published in an ACL 2006 paper, though the real theory is buried in an NUS technical report (yes, one of those amazing *unpublished* manuscripts). However, his sampler is not very good, so the purpose of this note is to explain a better algorithm. The error in his logic is exposed in the following quote (page 16 of the amazing NUS manuscript):

However each iteration is computationally intensive as each and can potentially take on many values, and it is expensive to compute the generalized Stirling numbers .

We use simple approximations and caching to overcome these problems. His technique, however, has generated a lot of interest and a lot of papers use it. But the new sampler/estimate is a lot faster, gives substantially better predictive results, and requires no dynamic memory. Note that when *the data at the leaves is also latent*, such as with clustering, topic models, or HMMs, then completely different methods are needed to enable faster mixing. This note is only about the simplest case where the data at the leaves is observed.

The basic task is we want to estimate a probability of the -th outcome at a node , call it , by smoothing the observed data at the node with the estimated probability at the parent node denoted . The formula used by Teh is as follows:

(1)

In this, the pair are the parameters of the PYP called *discount* and *concentration* respectively, where and . The s and s are totals, so and for all nodes in the graph. The DP is the case where .

For discount of , its easy to see that for large amounts of data the probability estimate converges to the observed frequency, so this behaves well.

*If the data is at the leaf nodes, where do the counts for non-leaf nodes come from?* Easy, they are totalled from their children’s counts (using to denote are the children).

*But what then are the ?* In the Chinese Restaurant Process (CRP) view of a PYP, these are the “number of tables” at the node. To understand this, you could go and spend a few hours studying CRPs. You don’t need to if you don’t know it already. The way to think of it is:

The are the subset of counts that you will pass from the node to its parent in the form of a multinomial message. So the contribution from to is a likelihood with the form . The more or less counts you pass up the stronger or weaker you are making the message.

The idea is that if we expect the probability vector at to be very similar to the vector at , then most of the data should pass up, so is close to . In this case, the first term in Equation (1) shrinks and the loss is transferred via so the parent probability contributes more to the estimate. Conversely, if the probability vector at should be quite different, then only a small amount of data should pass up and is closer to 1 when . Consequently is smaller so the parent probability contributes less to the estimate. Similarly, increasing the concentration makes the first term smaller and allows the parent probability to contribute more to the estimate, and vice versa. Finally, note for the DP the don’t seem to appear in Equation (1), but in reality thet do they are just hidden. They have propagated up to the parent counts so are used in the parent probability estimate . So the more of them there are, the more confident you are about your parent’s probability.

PYP theory and sampling has a property shared by many other species sampling distributions that this just all works. But there is some effort in sampling the .

This Equation (1) is a recursive formula. The *task of estimating probabilities* is now reduced to:

- Build the tree of nodes and populate the data in counts at leaf nodes .
- Run some algorithm to estimate the for nodes , and hence the for non-leaf nodes .
- Estimate the probabilities from the root node down using Equation (1).

So the whole procedure for probability estimate is very simple, *except for the two missing bits*:

- How do you estimate the (and thus for non-leaf nodes)?
- How do you estimate the discount and concentration parameters? By the way, we can vary these per node too!

We will look at the first of these two tasks next.

[…] Topical insights into components of prior research and aspects of latent academia « Training a Pitman-Yor process tree with observed data at the leaves (part 1) […]

by Training a Pitman-Yor process tree with observed data at the leaves (part 2) | Topical Issues November 13, 2014 at 8:02 am