3.7. Summary

This chapter highlighted the importance of considering issues of data provenance, data quality, and bias, while also celebrating the huge comprehensive collaborative data sets that characterise the field of computational biology. While in other fields, data inaccessibility is a major barrier to reproducible research, this is the field that had an online database system that remote computers could access in the 1960s! All of this catalogued information collected by researchers around the world populates freely available databases, vocabularies, and annotations, creating controlled and shared vocabularies that fuel computational methodologies.

With such a treasure trove of data, from model organisms as well as humans, there is more opportunity than ever for this data to be used to answer some of biology’s big questions, such as making genome-wide phenotype predictions. Multi-omics approaches that combine data types have already been successful at elucidating mechanisms behind certain phenotypes[157,158,159]. Obtaining an accurate prediction of phenotype and protein function even for a small class of variants, has the potential to greatly impact people, particularly if the prediction is explanatory, e.g. pointing to specific variants or protein domains as the cause of the phenotype.

Determining to what extent this data can currently be used for this purpose is the subject of the rest of this thesis.