4.1. Introduction

The Snowflake algorithm is primarily a phenotype prediction method: it takes data about individuals DNA as input and outputs predictions about which phenotypes each individual has. These predictions are based on how unusual an individual is for variants relating to each phenotype, and are made across a breadth of phenotypes and for missense variants across the protein-coding genome. It does this by combining existing predictions of variant deleteriousness from FATHMM[118] and association of protein domains to phenotypes from DcGO[114], and finding unusual combinations of these variants through clustering individuals against a diverse background cohort and looking for outliers. The phenotype prediction implicitly contains protein function predictions, due to the relationship between protein function and phenotype, and these are the key output of Snowflake. Focusing on protein domains thereby enables predictions in proteins that have not been well-studied, but restricts the number of predictions that Snowflake can make (since phenotypes can be caused by mutations which fall outside of domains). As a protein function predictor, Snowflake seeks rare combinations of SNPs which may influence a phenotype. In other words, Snowflake creates explanatory predictions: it looks for the mechanisms behind complex traits. Such complex traits are currently not well understood, but are thought to cause many human diseases.

4.1.1. Motivation

In chapter 2, we discussed the theoretical mechanism from which phenotype arises from genotype. In summary: differences in DNA cause differences in cell functionality, which interact with the cell environment to create differences in overall phenotype.

As previously mentioned, many recognised phenotypes are medical disorders or their symptoms. Currently to achieve diagnoses for genetic illnesses, specific genes are often sequenced one at a time, since looking at whole genomes would be too time consuming for clinical staff. Patients seeking diagnoses for rare genetic diseases describe the process as an “Odyssey”, more than half undiagnosed at any given time. If whole genome (or genotype) based phenotype prediction was possible, only one sample and test would be needed to get a much fuller picture of a person’s health, and we would be able to reduce the long and tiring process of obtaining diagnoses for rare genetic diseases. Applied to the plant and animal kingdom, phenotype prediction could also be beneficial in veterinary science and agriculture.

The discovery of underlying mechanisms for complex traits remains a particular challenge.