h1

Whither the Scientific Method

January 28, 2018

Long before the Industrial Age in Europe we had the Dark Ages. Popular culture tells us it was believed that the Earth was flat, witches caused the plague, and the ways of the world were decreed by kings, or God himself. While rationalist explanations of the world appeared independently in many ancient civilisations, the scientific method as we know it became prominent in the 19th century as a remarkable series of scientific and engineering discoveries propelled the world into the industrial age.  Indeed, Karl Pearson stated “the scientific method is the sole gateway to the whole region of knowledge.”.

With the pre-eminence of science in our modern society, controversies about science often occur in the media and public discussion, and the list of such areas is large. It doesn’t help that aspects of society, politics or religion have been falsely dressed up with “science,” so-called Scientism.  The expression “the science is settled” is a phrase from global warming skeptics that seeks to align global warning views with scientism (i.e., science is never settled so how can global warming be settled).  Note we can also view the statement “the science is settled” as a Socratian noble lie, therefore justifying its use in public discussion.

So apart from false applications of science, i.e., scientism, what flaws with the scientific method are there?

Flaws in the Science?

Medical science has suffered bad press in recent times. Best known through the popular work of John Ioannidis, provocatively titled “Why Most Published Research Findings Are False”, testaments from famous and authoritative medical researchers about the flaws of published medical research abound. As an empirical computer scientist, I can assure you flaws in research are not restricted to medical science, its just that medical science is perhaps our most societally important area of science.

Some of the discussed flaws in research are the misuse of p-values though a variety of means. For an entertaining example, see John Bohannon’s “I Fooled Millions Into Thinking Chocolate Helps Weight Loss.” Other flaws are so-called surrogate endpoints (a biomarker such as a blood test is used as a substitute for a clinical endpoint such as a heart attack), and others still are poorly matched motivations, i.e., for academics, the idea of “publish or perish” but for industry this would be “publish and profit”.  Many lists of flaws have been published.

In all, however, the scientific method holds up as a valid approach because the flaws invariable amount to corruption of the original method. One way the medical community addresses this is by adding an additional layer on top of the standard scientific method, often called the systematic review.  This is where unbiased experts review a series of scientific studies on a particular question, make judgments about the quality of the scientific method and the evidence, and develop recommendations for healthcare. The systematic review is, if you like, quality control for the scientific method.

The End of Theory?

Another seeming assault on the scientific method comes from data science. In 2008 Chris Anderson of Wired published a controversial blog about “The End of Theory”. The idea is that the deluge of data completely changes how we should progress with scientific discovery. We don’t need theory, he claims, we just extract information from the deluge of data. The responses, and there are many, came quickly, for instance Massimo Pigliucci said “But, if we stop looking for models and hypotheses, are we still really doing science?”, and others questions the veracity and appropriateness of much observational data, and hence its suitability as a subject of analysis.

Anderson’s “end of theory,” like John Horgan’s “end of science”, is not so much wrong as much more complex that it first seems. The relationship between data science and the scientific method is not simple. To understand this, consider that the poster child for 19th century science was physics.  Physics, a mathematical science, is fundamentally different to say modern medicine. In physics, Eugene Wigner’s notion of the “unreasonable effectiveness of mathematics” holds sway: from a concise theory we can derive enormous consequences. A relatively small amount of well chosen scientific hypotheses have uncovered vast regions of the engineering and physics universe. For instance, weather predictions are currently based on simulations built using the Newtonian laws of physics coupled with geophysical and weather data.

The imbalance (a small number of scientific hypotheses needed to justify a large area of science) indeed suits the scientific method. Peter Norvig, however, points out this is not feasible in areas such as biology and medical science, where the unreasonable effectiveness of mathematics does not hold. In these areas, the complexities of the underlying processes means we cannot necessarily simulate the impact of a eating raw cocoa or drinking red wine on heart health because the simulation or derivations from fundamental properties of nature are just too complex.

Norvig’s colleagues at Google, some of the founders of data science, instead refer to the unreasonable effectiveness of data. That is, fundamental complexity of some sciences mean we should instead be using data-driven processes for discovery of scientific details.

Data Dredging

To understand how data science can change the scientific method, we need to look at how it should not change it. Statisticians like to talk derisively about data dredging, with p-hacking being the best known example. As in the chocolate study mentioned above, this is where studies are repeated (in some way) until a significant p-value is obtained. They argue data driven discovery is dangerous. But this is the wrong viewpoint for data science. In complex areas like medical science or biology, we have many possible hypotheses and our intuitions can be poor in these complex worlds.

Computer science has an elegant theory of complexity called NP-completeness, which is the notion that one may need to test an exponential number of things before finding one that works.  A closer notion, though, is sample complexity.  This is the number of training cases we need to learn a concept.  In a sequence of experiments to uncover the causal structure of a domain, you are actively choosing the experiments sequentially, so a closer notion is the sample complexity of active learning.  While some theory exists here, and the numbers are no longer exponential, the results are not great.  A scientist could expect to be running an awful lot of experiments!  So this is the situation we find ourselves with hypothesis testing in the broader scientific world where the unreasonable effectiveness of mathematics fails.

In the early days of machine learning I worked at Prof. Ross Quinlan’s lab in Sydney. We soon discovered our own version of Ioannidis’ flaws in medical science that applied to machine learning. We called it theory overfitting, in contrast to regular overfitting which is an artifact of the bias-variance dilemma in statistics and machine learning. People tested a bunch of different theories on a small number, say 5 data sets, and eventually they find one that works, and so write it up and publish it. In truth its just another variant of p-hacking.

In data science, if we’re appling machine learning or neural network algorithms to a body of data, we are invariably trying to solve an NP-complete problem and are thus subject to overfitting or p-hacking. Even if we employ careful statistical methods to try to overcome this, we may subsequently be doing theory overfitting. However, if we don’t employ machine learning methods, we may never uncover reasonable hypotheses in the large pool of candidates. T his is the conundrum of data science for the scientific method when used in broader non-mathematical domains.

Powering the Scientific Method

Organisations and hard nose businesses have this conundrum effectively solved. At Kaggle for instance, and in TREC competitions, a test data set is always hidden from the machine learners, and only used for a final validation, which acts like a (final) cycle of the scientific method. The initial “develop a general theory” step of the scientific method has been done with machine learning. This can be considered to be millions of embedded hypothesise-test cycles. Thus we have an epicycle view of the scientific method.

But applying this approach in the medical world is not straight forward.  The medical research world keeps data registries that feasibly can be used to obtain data for discovery purposes. However, to obtain data, one usually has to apply for ethics clearance/approval, that the intended use of the data is good. The ethics committees who oversee the approval are the gatekeepers of data, and oftentimes they expect to see a valid scientific plan, not an open-ended discovery proposal. With the epicycle view of the scientific method, registries, when they release data for discovery exercises, would need to withhold some data for a final validation step in order to preserve the validity of the scientific method.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: