Summary
3.7. Summary¶
This chapter described the data pipelines undergone by genotype and phenotype data, before it can begun to be considered for protein function and phenotype prediction tasks.
This included introduction of some of the infrastructure of databases and software that the fields of genomics, bioinformatics and computational biology are built on. While in other fields, data inaccessibility is a major barrier to reproducible research, this is the field that had an online database system that remote computers could access in the 1960s! Huge quantities of catalogued information collected by researchers around the world populates freely available databases, vocabularies, and annotations, creating controlled and shared vocabularies that fuel computational methodologies. This chapter also briefly considered some of the potential sources of error in bias in these data, and attempts to overcome them.
With such a treasure trove of data, from model organisms as well as humans, there is more opportunity than ever for this data to be used to answer some of biology’s big questions, such as making genome-wide phenotype predictions. Multi-omics approaches that combine data types have already been successful at elucidating mechanisms behind certain phenotypes[162,163,164].
We should not forget that obtaining an accurate prediction of phenotype and protein function even for a small class of variants, has the potential to greatly impact people, particularly if the prediction is explanatory, e.g. pointing to specific variants or protein domains as the cause of the phenotype. Determining to what extent this data can currently be used for this purpose is the subject of the rest of this thesis.