Archive for the ‘data science’ Category


COVID-19 Missives

July 21, 2020

Various governmental and academic organisations across the world have been collecting data, building models, reporting statistics and so-forth.

Largely, these are examples of how not to do data science, though I’m not going to give specific examples for specific countries, but here are general examples:

  • changing the definition of what it means “die from COVID-19” or to “have COVID-19”
  • reporting test results without clear analysis of specificity and sensitivity of the tests and what that means in terms of errors
  • cross-country comparisons when their statistics are collected very differently
  • lack of comparison of statistics with comparible prior years’ deaths from different causes
  • lack of analysis of changes of death from other causes
  • using so-called “models” to predict future numbers, where the models appear to have little transparency, little evaluation or validation
  • lack of analysis/interpretation for different effects:
    • flu numbers are reportedly dropping in many places concurrently with COVID-19 cases, but why?

Of course, the standard problems with health care data just confound all this. Hospitals and healthcare are in a sense worst case scenarios for a data scientist, but certainly our hard workers in these communities share little blame. Its a systemic problem due to complications from the mixture of stakeholders, patient privacy, different generations of equipment and services and a chronically under-funded enterprise trying to operate while at the same time trying to upgrade itself.

As an example of more useful data, given all the errors in a lot of data specifically about COVID-19, is the EuroMOMO data showing deaths across Europe, for instance, their graphs and maps. The spikes of COVID-19 are very clear, especially in older age groups, and we can be very sure these substantial peaks are caused by COVID-19.

Kinds of things I’d have loved to have done are multi-country or multi-state/region analyses to tease out the effects of various key indicators like co-morbidities, age, healthcare, and lifestyles across the regions, but these are hard. We can all guess that places like New York and Wuhan get it bad due to crowding and pollution, and Somali’s in Sweden get it bad due to their chronic vitamin D deficiency, but its hard to be sure given all the confounding variables. Another is analyse longitudinal data of individual patients on the progression of disease and hospitalisation together with their socio-health data. In Australia, this is probably impossible to get other than for those who end up in ICUs.

Maybe, sometime in the future we’ll be able to look back and do better quality modelling and better quality analysis. But its also important to understand that this is how it is in real data science: lots of problems making simple statistics dangerous! Sure, there are some silly government and NGO decisions behind this too, but some of these problems are systemic and not caused by poor decisions.


Interview on AI with Ray Ram Thanni

December 7, 2019

I did an interview on high-level issues with AI for the Monash Blockchain Alliance, way back in 24th April 2019.  Its up on Youtube, finally.


Caitie on Einstein A Go Go

October 27, 2019

Caitie and Wray at 3RRR

Caitie Doogan, through her Twitter activies, got invited to talk on a science radio programme on 3RRR:  Einstein A Go Go – 27 October 2019 (our section starts at 25:15 in). Dr Krystal, Dr Ray & Dr Shane were the three scientists asking really relevant questions.  Caitie’s work is at the interface of applications and machine learning, so is far more accessible to the science public.  I came along for the ride, and discussed my work with Turning Point.


Turning Point and Monash at Google

July 31, 2019

Turning Point colleague, Sam Campbell, and I went to the Google AI Impact Challenge Accelerator event in London on 29th-31st July (see the video here, I get a brief shots at 0:14 and 0:20 and Sam gets an interview at 0:49).  Sam is the in-house data science person at Turning Point and a graduate of the Monash Master of Data Science programme.  I’ve been coaching him on modern text-mining with deep neural networks, and we’re starting to realise just how accurate the predictions can become.

Following our Launchpad event to San Francisco, the London event was an in depth technology review so we could help design our system.  We had a number of very clued-in Google experts supporting us in designing our architecture.  My systems and internet applications experience is over 20 years old, so I get the general ideas but don’t know the modern specifics!

AI Impact Challenge Accelerator Tech Sprint

All the teams at the London tech event, 31/07/19.

Lots the of AI Impact folks from Google attended and we all agreed the committment and support from Google was fabulous.  We’re in the top right, with Sam holding the “Turning Point” sign, me below him.

We learnt a lot about designing AI systems, and I saw some great tutorials on machine learning.  For human centred design, see PAIR: People and AI Research.  We spoke to contributors such as Di Dang and Roxanne Pinto.  Their guidebook is an absolute treasure trove.  The machine learning group told us about using the cloud, their AutoML, and they presented some fabulous tutorials on things like data wrangling.


AI suicide surveillance with Turning Point

June 23, 2019


Wray, Dan and Debbie at Google Launchpad, 15/05/2019

Along with Prof. Dan Lubman’s team at Turning Point, and funded through Google’s innovative AI for Social Good programme, I’ll be developing an AI system to accelerate “coding” (a form of content analysis) of ambulance records so we can understand the nature of suicide in Australia.   The local press has it so:

Google taps local addiction service to build AI suicide surveillance system

As the Google blurb says:

By using AI tools to analyze these records, Turning Point, a national center within Eastern Health, will uncover critical suicide trends and potential points of intervention to better inform policy and public health responses.

For us researchers, it means unifying a bunch technologies that I’ve been working in for a while like active learning, multi-label classification, multi-task learning and crowd-sourcing.  But most importantly, all these need to be placed in the context of​ doing accurate and properly monitored coding while at the same time trying to minimise costly expert (human) effort.  This is important stuff for NGOs and health organisations so we’re really excited by the application opportunities this can give us all.

In mid May, Dan Lubman, Debbie Scott and I flew off to Google’s Lauchpad Space in San Francisco to spend a week with other members of the programme for a bootcamp, to brainstorm about our project and get coaching from Google’s experts.  Google has a lot of other plans for us to, in terms of supporting the development, which we are very grateful for!  The Google advisors were fabulous.  They have all sorts of practical tech and AI knowledge, and a real diversity of start-up experts.

Dan, Debbie and I left the event with a whole different perspective on what we could do and what we should be doing initially.  We then set about the process of making change within Turning Point.  Turning Point is a sophisticated organisation working on mental health.  Yes, they’d integrated all sorts of computer systems and interfaces in their daily work, in fact a key reason behind their success to date, but we came away from the Google event realising just how much more we could do with AI.


Fabulous data science tag cloud

June 2, 2018

This comes from PhD student Caitlin Doogan.

Tag Cloud on Data

Tag Cloud on Data by Caitlin Doogan


Graduating MDS students

May 25, 2018

Our first larger batch of MDS students graduating.   Here are some who attended the ceremony.  Really great students!


MDS Graduation May 2018


The Big Tech Healthcare Invasion

April 25, 2018


The Big Tech Healthcare Invasion Infographic


Facebook and Data Science

April 6, 2018

My favorite topics in teaching, other than Bayesian statistics (“of course”), are about interesting applications, ethics and impact to society.  One of the things I always do is point out that many of the big technology companies are fundamentally “data” companies selling their consumer data to advertisers.  Lots of gnarly ethical issues here.  But the huge sleeper issue in all this is medical informatics where medical research really needs consumer lifestyle data if it wants to make major breakthroughs in the lifestyle diseases that are gradually strangling the Western economies.  Even gathering lifestyle data is difficult (think diet, for instance), let alone dealing with the ethics and privacy involved.

Anyway, great piece by Jennifer Duke “You’re worth $2.54 to Facebook: Care to pay more?” in the Australian press today (SMH, The Age).  We had an insightful 15 minute discussion on the phone on Friday and I managed to get a worthy quote in her article.  Impressed by her broad knowledge of the topics.  Good to see our journalists know their stuff.

Reuters has an extensive piece outlining details and the election influence, Cambridge Analytica CEO claims influence on U.S. election, Facebook questioned.

The Conversation seems to be surfing the media hub-bub with a dozen or more articles from the academic community in the last week or so.  Here are some that caught my eye.

Some other background articles are:

  • An older article in the Huffington Post, Didn’t Read Facebook’s Fine Print? Here’s Exactly What It Says, and commenting on an older terms of service, but a lot still applies.

  • The Australian Government has fairly strong privacy laws under the Privacy Act and its amendments.  This is described at Guide to securing personal information, which has a broad definition of personal information that probably covers most of what Facebook keeps.  Though a special class of information, sensitive information, which includes medical and financial details, silent phone numbers, etc., and requires a higher level of protection.
  • In 2014 Cambridge researchers Kosinski, Stillwell and Graepel published an article on PNAS (Proc. National Academy of Sciences of the USA) showing Private traits and attributes are predictable from digital records of human behavior.  If that sounds too technical, the short version is:
    If your a frequent user, Facebook probably knows your religion, sexual preferences and any serious diseases you might have, and major personality traits, even if you take great care not to expose them.
    Keep in mind this was the best known of a long series of research.  When this information is inferred (i.e., predicted using a statistical algorithm) it is called implicit information.
  • Note Facebook has been relaxing their privacy default settings over the years, The Evolution of Privacy on Facebook, according to Matt McKeon.  This makes their job of monetising their users easier.
  • Note it is not clear what the Privacy Act says about implicit information.  Note implicit information can be very hard to extract, and require access to a fuller database to make inferences.

Personally, I believe online data privacy will evolve in fits and bursts, but there are a lot of technical hurdles.  Online advertising, for instance, needs to turn around impressions at great speed and doesn’t have time to work through complex APIs so I suspect they will need the personal data in some form on their own servers.  Sounds like a perfect application for cryptosystems to me, if it can be made fast enough.  As for data harvesting, well, I expect that will go on forever.





Trying out DataCamp this semester

February 21, 2018

Our Master of Data Science students explore a lot of things and discuss.  I got a lot of requests to include the excellent material from DataCamp:

DataCamp logo

DataCamp – who support data science education for free

So we’ll see how it goes.  Not sure how well I’ll get to integrate it, because this semester I’m working more on our introductory statistics class.