
Training a Pitman-Yor process tree with observed data at the leaves (part 2)
November 13, 2014So here is a recap of the previous post:
A simple problem with hierarchical Pitman-Yor processes (PYPs) or Dirichlet processes (DP) is where you have a hierarchy of probability vectors and data only at the leaf nodes. The task is to estimate the probabilities at the leaf nodes. The hierarchy is used to allow sharing across children …
We have a tree of nodes which have probability vectors. The formula used estimates a probability for node from its parent
as follows:
(1)
In this, the pair are the parameters of the PYP called discount and concentration respectively, where
and
. The
are magical auxiliary (latent) counts that make the PYP samplers work efficiently. They are, importantly, constrained by their corresponding counts, and the counts are computed up from the leaves up (using
to denote
are the children of
). The
s and
s are totals. These relationships are summarised in the table below.
constraint | |
constraint | |
propagating counts | |
total | |
total |
It is important to note here how the counts propagate upwards. In some smoothing algorithms, propagation is . This is not the case with us. Here the non-leaf nodes represent priors on the leaf nodes, they do not represent alternative models. PYP theory tells us how to make inference about these priors.
Now the theory of this configuration appears in many places but Marcus Hutter and I review it in Section 8 of an ArXiv paper, and its worked through for this case in my recent tutorial in the section “Working the n-gram Model”. The posterior distribution for the case with data at the leaf nodes is given by:
(2)
There are a lot of new terms here so let us review them:
is the Pochhammer symbol (a generalised version is used, and a different notation) which is a form of rising factorial so its given by |
|
a related rising factorial given by |
|
the root node of the hierarchy | |
the initial/prior probability of the k-th outcome at the root | |
a generalised Stirling number of the second kind for discount |
Background theory for the Stirling number is reviewed in Section 5.2 of the ArXiv paper, and developed many decades ago by mathematicians and statisticians. We have C and Java libraries for this, for instance the C version is on MLOSS so its O(1) to evaluate, usually.
Now to turn this into a Gibbs sampler, we have to isolate the terms for each individually. Let
be the parent of
. The constraints are:
constraint | |
constraint | |
propagating counts | |
propagating counts |
The propagating constraints do not apply when is the root, and in this case, where $\vec{p}^\top$ is given,
does not exist. For
the parent of
, the constraints can be rearranged to become:
when |
|
when |
|
when |
The terms in Equation (2) involving when
is not the root are:
(3)
where note that appears inside
,
and
. When
is the root, this becomes
(4)
The Gibbs sampler for the hierarchical model becomes as follows:
Gibbs sampler
For each node
do:
for each outcome
do:
sample
according to Equation (3) or (4) satisfying constraints in the last table above.
endfor
endfor
As noted by Teh, this can result in a very large range for when
is large. We overcome this by constraining
to be within plus or minus 10 of its last value, an approximation good enough for large
.
Moreover, we can initialise to reasonable values by propagating up using the expected value formula from Section 5.3 of the ArXiv paper. We only bother doing this when
. The resultant initialisation procedure works in levels: estimate
at the leaves, propagate these to the
at the next level, estimate
at this next level, propagate these up, etc.
Initialisation
for each level, starting at the bottom do:
for each node
at the current level do:
for each outcome
do:
if
then
elseif
then # the PYP case
else # the DP case
#
is the digamma function
endif
endfor
endfor
for each node
directly above current level do:
endfor
endfor
This represents a moderately efficient sampler for the hierarchical model. To make this work efficiently you need to:
- set up the data structures so the various counts, totals and parent nodes can be traversed and accessed quickly;
- set up the sampling so leaf nodes and parents where
is guaranteed to be 0 or 1 can be efficiently ignored;
- but more importantly, we also need to sample the discounts and concentrations.
Note sampling the discounts and concentrations is the most expensive part of the operation because each time you change them the counts also need to be resampled. When the parameters are fixed, convergence is fast. We will look at this final sampling task next.
[…] We will look at the first of these two tasks next. […]