Archive for the ‘students’ Category


On the “world’s best tweet clusterer” and the hierarchical Pitman–Yor process

July 30, 2016

Kar Wai Lim has just been told they “confirmed the approval” of his PhD (though it hasn’t been “conferred” yet, so he’s not officially a Dr., yet) and he spent the time post submission pumping out journal and conference papers.  Ahhh, the unencumbered life of the fresh PhD!

This one:

“Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes”, Kar Wai Lim , Wray Buntine, Changyou Chen, Lan Du, International Journal of Approximate Reasoning78 (2016) 172–191.

includes what I believe is the world’s best tweet clusterer.  Certainly blows away the state of the art tweet pooling methods.  Main issue is that the current implementation only scales to a million or so tweets, and not the 100 million or expected in some communities.  Easily addressed with a bit of coding work.

We did this to demonstrate the rich possibilities in terms of semantic hierarchies one has, largely unexplored, using simple Gibbs sampling with Pitman-Yor processes.   Lan Du (Monash) started this branch of research.  I challenge anyone to do this particular model with variational algorithms 😉   The machine learning community in the last decade unfortunately got lost on the complexities of Chinese restaurant processes and stick-breaking representations for which complex semantic hierarchies are, well, a bit of a headache!


Basic tutorial: Oldie but a goody …

November 7, 2015

A student reminded me of Gregor Heinrich‘s excellent introduction to topic modelling, including a great introduction to the underlying foundations like Dirichlet distributions and multinomials.  Great reading for all students!  See

  • G. Heinrich, Parameter estimation for text analysis, Technical report, Fraunhofer IGD, 15 September 2009 at his publication page.

Data Science Resources

October 26, 2015

For my main job, I am Director of the Master of Data Science.  This is a fast paced field that is just as much industry as academia, and a lot of the really exciting stuff is applications.  To keep up you need to monitor the media.  There are too many resources to name or list them all, or to attempt to do some kind of thorough tracking.  I recommend students, however, to install a news aggregator on their tablet/smart-phone/laptop and enrol in some of the better and more relevant RSS feeds, to keep track.

All the big business and technology magazines have relevant sections on Data Science or Big Data:  Forbe’s, Harvard Business Review, O’Reilly, ZDNet, MIT Sloan Management Review, Information WeekWired, InfoWorld, TechCrunch (big data) and TechCrunch (data science), … Each of these has a particular perspective, which is useful in understanding their contributions.  For instance, TechCrunch is a technology startup magazine whereas Forbes targets Fortune 500 companies.  The articles in this class of magazines usually are good quality, although there are sometimes “commissioned” journalism or press releases for marketing.

Many technology blogs focus on Data Science.  The following are listed as most popular first: and its offshoot,,,  The first, KDNuggets has been in the business for almost two decades.  Many of these have email and RSS subscription services and Twitter feeds.  Some of these have a low signal to noise ratio so it is easy to get drowned in content.  See also Quora’s What are the best blogs for data scientists to read?” for more discussion.

There are two weekly newsletters that you should sign up to for great content in your email. The Data Science Weekly Newsletter has more of a technology orientation with, for instance, some popular machine learning content.  The O’Reilly Data Newsletter is more about industry and is essential reading for anyone who wants to remain current.

Most of the blogs are also coupled with curated information sources.  Other site with curated information are Resources to Learn Data Science Online and Big Data and Applications Knowledge Repository.  This second one also has a good list of conferences.

A related category are the question answering sites: Quora has Data Science and Big Data channels, though many other discussions are useful too.  A site more in the Slashdot style is is a site that records infographics.  e.g., queries for “data science” and “big data“.  These are seductive, and some certainly informative. also has an infographics section.  Some notables here that go way beyond infographics are cheat sheets: Machine Learning Cheat Sheet and the Probability Cheat Sheet.  These are handy academic references, and also a nice way to find out what you do not know.

Many sites give collections of data sets, so perhaps the  most notable here are: data awesome public datasets, Google’s public data directory, large data sets, …  The Internet Archive is a long running source of free digital content (books, etc.).  There are many, many more such sites, especially as governments now support open data.

Finally, most terms and concepts are well explained in the Wikipedia, often with good diagrams and related discussions.  As one delves into the more esoteric aspects of statistics or computer science, the quality of Wikipedia’s entries drop’s off.  Wikipedia’s definition of Data Science, for instance, as “a continuation of the field data mining and predictive analytics” would be hotly contested by some, but others would find the distinctions not that important.

WikiBooks has now produced Data Science: An Introduction, which I haven’t looked at properly yet but the outline seems OK.  I am skeptical of such efforts because the typical academic author has a focused speciality and a list of axes to grind … not me of course, oh no, not me 😉


Some diversions

July 4, 2015

Quora‘s answers on What are good ways to insult a Bayesian statistician?   Excellent.  I’m insulted 😉

Great Dice Data: How Tech Skills Connect.  “Machine learning” is there (case sensitive) under the Data Science cluster in light green.

Data scientist payscales: “machine learning”‘ raises your expected salary but “MS SQL server” lowers it!

Cheat sheetsMachine Learning Cheat Sheet and the Probability Cheat Sheet.   Very handy!


MLSS 2015 Sydney tutorial

February 23, 2015

This Sydney 2015 MLSS summer school is organised by Edwin Bonilla and held in Sydney Feb 16-25th.  My tutorial is titled “Models for Probability/Discrete Vectors with Bayesian Non-parametric Methods.”  My final version of the slides is here in PDF.


Information for candidate research students

November 2, 2014

I wrote a page for candidate research students here.  Always happy to hear from you folks.


Some favourite tutorials

September 10, 2014

First, if you are starting out, you need to see How to do good research, get it published in SIGKDD and get it cited!  This is an amazing tutorial from Eamonn Keogh back in 2009, but nothing changes, right?   Lots of gems in there.

Here are some tutorials from others I highly recommend, from the fabulous Video Lectures website.   The titles are pretty good descriptors of content.  Ideally, this is what students need to know for research in topic models and related material like Bayesian probability, graphical models, MCMC, etc.

Also, if you’re into graphical models, the best set of lectures on theoretical material I know is from Prof. Stephen Lauritzen (now retired), Graphical Models and Inference, a course presented at Oxfords Statistics some time ago.  He is the most outstanding researcher in this field.


Wikipedia on probability theory

July 11, 2013

I created a PDF map of probability theory, the stuff that matters for Bayesian analysis, using the concepts available from the Wikipedia, with clickable links to the actual pages. Open and view at 600% to read the writing!  It should be on an A2 page.

In some cases, critical stuff is missing so I’ve just left an open box with a title there.  Its broken into areas and you follow the arcs backwards to get prerequisites.  Generally the Wikipedia material is pretty good!  Coverage of some areas is poor though (graphical models with plates, Pitman-Yor processes, etc.)

So you can get a pretty good education from the Wikipedia.  All that’s missing is exercises.  Note Wikipedia also has the concept of books, so for an undergraduate statistics coverage you can see the Wikipedia Book on Statistics, but I like to see a map of relationships.

The original is in DOT and I generate a big PDF file with clickable Wikipedia icons.    Originally I output it to SVG but the viewers would not scale up enough to the size of the page, so used PDF instead.

Any suggestions?