Archive for July, 2020


COVID-19 Missives

July 21, 2020

Various governmental and academic organisations across the world have been collecting data, building models, reporting statistics and so-forth.

Largely, these are examples of how not to do data science, though I’m not going to give specific examples for specific countries, but here are general examples:

  • changing the definition of what it means “die from COVID-19” or to “have COVID-19”
  • reporting test results without clear analysis of specificity and sensitivity of the tests and what that means in terms of errors
  • cross-country comparisons when their statistics are collected very differently
  • lack of comparison of statistics with comparible prior years’ deaths from different causes
  • lack of analysis of changes of death from other causes
  • using so-called “models” to predict future numbers, where the models appear to have little transparency, little evaluation or validation
  • lack of analysis/interpretation for different effects:
    • flu numbers are reportedly dropping in many places concurrently with COVID-19 cases, but why?

Of course, the standard problems with health care data just confound all this. Hospitals and healthcare are in a sense worst case scenarios for a data scientist, but certainly our hard workers in these communities share little blame. Its a systemic problem due to complications from the mixture of stakeholders, patient privacy, different generations of equipment and services and a chronically under-funded enterprise trying to operate while at the same time trying to upgrade itself.

As an example of more useful data, given all the errors in a lot of data specifically about COVID-19, is the EuroMOMO data showing deaths across Europe, for instance, their graphs and maps. The spikes of COVID-19 are very clear, especially in older age groups, and we can be very sure these substantial peaks are caused by COVID-19.

Kinds of things I’d have loved to have done are multi-country or multi-state/region analyses to tease out the effects of various key indicators like co-morbidities, age, healthcare, and lifestyles across the regions, but these are hard. We can all guess that places like New York and Wuhan get it bad due to crowding and pollution, and Somali’s in Sweden get it bad due to their chronic vitamin D deficiency, but its hard to be sure given all the confounding variables. Another is analyse longitudinal data of individual patients on the progression of disease and hospitalisation together with their socio-health data. In Australia, this is probably impossible to get other than for those who end up in ICUs.

Maybe, sometime in the future we’ll be able to look back and do better quality modelling and better quality analysis. But its also important to understand that this is how it is in real data science: lots of problems making simple statistics dangerous! Sure, there are some silly government and NGO decisions behind this too, but some of these problems are systemic and not caused by poor decisions.