3.5. Sources of bias in computational biology

The wealth of Open resources in computational biology, from databases to predictive methods to ontologies, hold exciting possibilities and are a credit to the collaborative spirit of the field. It’s still important, however, to look at them with a critical eye, in order to be aware of their limits.

3.5.1. Trusting the results of research

The imposing edifice of science provides a challenging view of what can be achieved by the accumulation of many small efforts in a steady objective and dedicated search for truth.

—Charles H. Townes

We all want to be able to trust the results of scientific research. Not only when it’s our own, but because science builds on itself and building on shaky ground wastes time and money. Moreover, scientific research is generally paid for by tax, and the results that are generated by it drive policy, drug treatments, and innovations. Everyone has a vested interest.

In all fields, science is a search for knowledge. And in all fields, there are concerns about what makes bad, unreliable, un-useful, or biased research; what must be done or not done to uphold science’s claim to truth, or at least reliability.

In contrast to other fields, many bioinformatics datasets have been freely available and accessible on the internet since their inception; in this sense the field is far ahead of others. The issues which affect the reliability of science in general, however, are likely to be present in computational biology, too. This could have strong effects on the research that is reliant on these large ontologies and databases.

3.5.1.1. Science’s self correcting mechanism

Scientific results are often based on statistics, so it’s inevitable that some proportion of published scientific results will not be true simply due to the sample on which the hypothesis was tested. The common wisdom is that this isn’t a problem, as over time, researchers can double-check interesting scientific results, and the literature can be updated to reflect that. This is known as sciences self-correcting mechanism. If a result can be replicated in a different circumstance by a different person, it reinforces the likelihood that the result is true. A replication doesn’t have to reveal the exact same level of statistical significance or effect size to be successful, but (usually, depending on definitions) just a similar result.

3.5.1.2. What makes research trustworthy?

Table 3.2 Definitions relating to reproducibility, adapted from The Turing Way.

Data: Same

Data: Different

Analysis: same

Reproducible

Replicable

Analysis: different

Robust

Generalisable

There are different levels of trust that we might have in the results of research. This ranges from a basic trust that the researchers didn’t make any mistakes in their implementation (their code is doing what they thought it was) to a trust that the result is trustworthy even in new contexts. Much of the academic discussion surrounding this hinges on the concept of reproducibility, for which there are many contrasting definitions. I like the definitions from The Turing Way[123], shown in Table 3.2. This table says for example if you get the same result with the same data and same analysis as the original research, then the result is reproducible. And if you get the same result when the data is the same, but the analysis is different (e.g. a different implementation of the code, or a different specific analysis meant to measure the same thing), then the result is robust.

Although a generalisable result is the most desirable and interesting, as long as the research is reproducible, it can still positively contribute towards our joint scientific knowledge. This definition of reproducibility also requires that everything needed to run the experiment again is provided, including fine details of methods (in computational biology, this is often equivalent to code) and data. In the absence of this, science’s self-correcting mechanism is short-circuited.

3.5.2. The reproducibility crisis

In science consensus is irrelevant. What is relevant is reproducible results.

—Michael Crichton

The reproducibility crisis is the realisation that worryingly large proportions of research results do not replicate. Replication studies have found that only 11% of cancer research findings[124], 20-25% of drug-target findings[124,125], and 39% of psychology findings[126] could be reproduced. Surveys of researchers across disciplines reveal that more than 70% of scientists say they have failed to reproduce another paper’s result, and over 50% say they have failed to reproduce their own results[127]. It seems that science’s self-correcting mechanism is not working as intended.

3.5.3. Sources of irreproduciblity and how to combat them

The surprising irreproducibility seen is thought to be explained by a range of factors[128] including poor data management, lack of available materials/details of experiments, publication bias, poor statistical knowledge, and questionable research practices such as HARKing and p-hacking. Although it is difficult to estimate, a very small proportion of irreproducible research is thought to be due to fraudulent practices[129] (although these do still happen), and arguably it’s explainable simply from our reliance on Null Hypothesis Significance Testing[130].

3.5.3.1. Null Hypothesis Significance Testing

To discuss some of these issues, we first have to understand how scientific hypotheses are usually tested and reported: Null Hypothesis Significance Testing (NHST). This reporting usually consists mostly of a p-value as a measure of statistical significance: the likelihood that a false positive at least this extreme could be obtained just by chance. The threshold for this, usually denoted by \(\alpha\) is most often set to 0.05, as recommended by Fischer, however this is not necessarily the most sensible cut-off for science today[131,132], and different fields have differing cut-offs.

Statistical significance is the least interesting thing about the results. You should describe the results in terms of measures of magnitude –not just, does a treatment affect people, but how much does it affect them.

—Gene V. Glass

Despite the dominance of p-values as main or only reported statistic across scientific fields, they do not imply that a result is interesting (the effect might be small or the hypothesis uninteresting), or even that it’s likely to be true. Sometimes the p-value is not even reported, but only whether or not it crossed the p<0.05 threshold.

3.5.3.2. P-hacking and HARKing

../_images/p_hacking.png

Fig. 3.5 Images that are illustrative of researchers approaches to p-values and p-hacking. The left image is a popular tweet, while the right image is an xkcd comic).

The pressure on scientists to publish means that researchers may be tempted to (or may accidentally, due to statistical ignorance) employ data-mining tactics in order to harvest significant p-values. This practice is known as “p-hacking”, and evidence for its existence can be found in distributions of p-values in scientific literature[134], as well as popular culture (Fig. 3.5). This can include rerunning analysis with different models/covariates, collecting data until a significant p-value is reached, or performing 20 experiments and only publishing the results of one as in HARKing.

The first principle is that you must not fool yourself – and you are the easiest person to fool. – Richard Feynman

There are several suggested tonics to the problem of uninformative and ubiquitous p-values. Reporting p-values as numbers (e.g. “p=0.0012”) rather than in relation to a threshold (e.g. “p<0.05” or “the hypothesis was found to be highly significant”) is a starting point. Information about statistical power and effect size should also be provided. In addition to giving researchers reading a paper a better idea of the quality of it, this also allows science to self-correct a little easier, since individual p-values can then be combined into more reliable p-values, using for example Fischer’s method[135].

For cases where many hypotheses are being generated at once (for example in GWAS), multiple hypothesis corrections (e.g. the Bonferroni correction[136] or the False Discovery Rate[137]) can be employed to adjust the p-value to account for this.

3.5.3.3. Publication bias

Although with standard p-value and statistical power cut-offs, negative results are more likely to be true than positive ones[130], negative results are much harder to publish. This bias is likely to be responsible for the draw of questionable research practices like p-hacking. It also means that there is a lot of unpublished, negative results which are likely to be repeated, since there is no way that someone could know it has already been done. A highly powered negative result could be very interesting, for example, we know hardly anything about which genes do not appear to affect phenotypes, since these results are not published[90], but they would help enormously with the challenge of creating a gold standard data set for gene function prediction.

Publication bias is usually used to descibe the bias against negative results but there are other forms of publication bias which affect computational biology, for example the discrepency in which genes are studied. Some genes are very famous, racking up thousands of publications, while others are entirely unstudied. Even looking only at human genes, there is a huge divide between the most and least studied genes. This means that there are many functions of genes which will be missing from the gene ontology annotations (for example) for less well-studied genes.

Another example is the bias against replications. Repeats of studies are not commonly published (as they are not novel). Naturally, this discourages people from doing them, or at least from writing them up. This is true for both computational methodologies (where often the code and computational environment needed to replicate the research are not provided), and for experimental work. This means that it is difficult for science to self-correct this work.

3.5.3.4. Code and data availability

A lack of code and data availability, while not necessarily leading to wrong or untrustworthy results, also has a place in making research irreproducible since:

  1. Research cannot employ it’s self-correcting mechanism unless the experiment can be repeated.

  2. If the code or data is obscured, then many of the decisions may also be obscured. Decisions we make in analysis[138,139] and even small details of implementation can effect the results of research, and whether we expect it to generalise to another similar context.

This is a problem even in computational fields: in computational biology, roughly a third of papers that use code still do not provide it at all[140]. Even then, providing the data or code somewhere is not in itself enough to overcome the problem of irreproducible methods: it must also be usable. The FAIR principles[141,142] provide a framework for ensuring this. In the context of software, this includes sharing your reproducible computational environment (for example, exact versions of the packages that you used).

While there is a growing movement towards transparent and available materials, and taking time to create these (e.g. the slow science manifesto[143]), there is also friction against it due to research’s deeply ingrained culture of “publish or perish”. This incentivises doing the minimum possible to publish, and disincentivises spending time on quality control or providing useful metadata wherever you can publish without it.