Visualising a topic model

March 25, 2016

Finally decided to write a proper visualiser for topic models. I used the WordCloud Python tool from AMueller[GitHub]. Modified it because the input I needed to use words with precomputed scores, rather than text input. Moreover, I wanted two dimensions for words displayed, size (word frequency in topic) and lightness (degree to which the word is characterised by the topic, measured as frequency over document frequency). I also scale the final tag cloud depending on the size of the topic in the corpus. The correlation between topics is computed from the document-topic proportions. All these then go into GraphViz, where nodes are displayed as images and a lot of careful weighting and organising of the number of topic correlations to display, edge weights, etc.

Below are the results on ABC news articles from their website 2003-2012 collected by Dr. Jinjing Li of NATSEM in Canberra. These images are about 4000 by 4000 pixels. YOU will not be able to view it unless you:

get on a big screen,
click on the image to enter image view mode,
then scroll down to bottom right, click on “View full size” to bring it up,
and then zoom around to view.

Banking in the Australian news (from ABC website 2003-2012).

Obesity in the Australian news (from ABC website 2003-2012).

To produce the banking one, I do the following commands with hca:

#  generate the topic model into result set B1
hca -Ang -v -K50 -C1000 -q2 bank B1
#  compute the diagnostics
hca -v -v -V -V -r0 -C0 bank B1
#  generate the image
topset2word.pl --dot "-Kfdp" --lang png  B1 BN1

Posted in results, software | Tagged HCA |

Bayesian Models