4.1. Introduction¶

The Snowflake algorithm is primarily a phenotype prediction method: it takes data about individuals DNA as input and outputs predictions about which phenotypes each individual has. These predictions are based on how unusual an individual is for variants relating to each phenotype, and are made across a breadth of phenotypes and for missense variants across the protein-coding genome. It does this by combining existing predictions of variant deleteriousness from FATHMM[123] and association of protein domains to phenotypes from DcGO[119], and finding unusual combinations of these variants through clustering individuals against a diverse background cohort and looking for outliers. The phenotype prediction implicitly contains protein function predictions, due to the relationship between protein function and phenotype, and these are the key output of Snowflake. Focusing on protein domains thereby enables predictions in proteins that have not been well-studied, but restricts the number of predictions that Snowflake can make (since phenotypes can be caused by mutations which fall outside of domains). As a protein function predictor, Snowflake seeks rare combinations of SNPs which may influence a phenotype. In other words, Snowflake creates explanatory predictions: it looks for the mechanisms behind complex traits. Such complex traits are currently not well understood, but are thought to cause many human diseases.

4.1.1. Motivation¶

In chapter 2, we discussed the theoretical mechanism from which phenotype arises from genotype. In summary: differences in DNA cause differences in cell functionality, which interact with the cell environment to create differences in overall phenotype.

As previously mentioned, many recognised phenotypes are medical disorders or their symptoms. Currently to achieve diagnoses for genetic illnesses, specific genes are often sequenced one at a time, since looking at whole genomes would be too time consuming for clinical staff. Patients seeking diagnoses for rare genetic diseases describe the process as an “Odyssey”, more than half undiagnosed at any given time. If whole genome (or genotype) based phenotype prediction was possible, only one sample and test would be needed to get a much fuller picture of a person’s health, and we would be able to reduce the long and tiring process of obtaining diagnoses for rare genetic diseases. Applied to the plant and animal kingdom, phenotype prediction could also be beneficial in veterinary science and agriculture.

The discovery of underlying mechanisms for complex traits remains a particular challenge. Each prediction in snowflake can be explained by which protein the variant is in, why that protein is predicted to be deleterious, and how common or rare that variant is.

4.1.2. Related work¶

4.1.2.1. Phenotype predictors and variant prioritisation¶

Biology databases are home curated, open data that cover genomes across the tree of life, as well as cross-species ontologies of biological processes, diseases, and anatomical entities. There are a number of recent phenotype prediction methods that have had some success in using these resources for either variant prioritisation or use as a clinical diagnostic tool.

There are a class of “knowledge-based” methods, which use knowledge from databases of experimental results (known associations between genes and phenotypes) as the basis for these predictions, for example Phen-Gen[166], dcGO[119], PhenoDigm[167], and PHIVE[168]. The better performing methods in this class, use associations between model organisms and orthologous genes, to leverage the wealth of information that is collected from these model organism experiments.

There are also “functional” methods, like FATHMM[123] and CADD[169], which instead use information about how the molecules and their function may change with different nucleotide or amino acid substitutions, as well as conservation metrics to prioritise variants. These tools rank variants for deleteriousness, but do not link them to specific phenotypes.

Most successful methods of any kind now combine multiple sources of information, some combine both functional and knowledge-based sources. This approach is used within Exomiser[170], which combines PHIVE with many other sources of information such as protein-protein interactions, cross-species phenotype associations, and variant frequency using a black-box classifier. Phenolyzer[171] and Genomiser[172] also take similar approaches of combining many different sources of data.

The aim of these models is mostly to prioritise variants associated with diseases, and they are bench-marked by their ability to identify known variants. Lists of known variants may be purpose-curated from the literature according to specific evidence, or may come from some subset of annotation databases (which in some cases the algorithm may have used as input data). Each phenotype predictor often targets a specific use case (e.g. non-coding variants), and in combination with the varying validation methods used, it is difficult to compare the accuracy of all of these models directly. For this reason, the CAFA competition is very useful in getting a more objective view of the capabilities of these kinds of tools.

Similar approaches have been used as clinical diagnostic tools. PhenIX (Phenotypic Interpretation of eXomes)[173] is a version of PHIVE which is restricted to the “human disease-causing genome” (genes known to cause disease) to make it more suitable for clinical use in diagnosis of rare genetic diseases. It also includes semantic similarity information between inputted symptoms and Human Phenotype Ontology terms, using the Phenomizer[174] algorithms. For PhenIX the measure of success is that it enabled skilled clinicians to find diagnoses for 11 out of 40 (28%) patients with rare genetic diseases, who were not able to be diagnosed through other means.

While these examples are the most similar published work to Snowflake, they are all tested as variant prioritisation tools rather than phenotype predictors.

4.1.2.2. Clustering and outlier-detection in genetics¶

Clustering algorithms, particularly hierarchical methods, are commonly used in genetics for:

finding evolutionary relationships between DNA samples, for example, in reconstructing phylogenetic trees and mapping haplotypes within populations[175].
finding functional relationships between genes based on gene expression data[176,177].

For applications in group (1), individuals are generally separated into clusters based on their DNA variants, whereas for (2) samples are separated into clusters based on their gene expression.

Clustering methods are only very rarely used to cluster individuals in phenotype prediction or variant prioritisation tasks. In one case, clustering individuals based on a combination of genotype and phenotype information has been applied to identify subtypes within emphysema[178] (a lung disease).

4.1.2.3. Overcoming the curse of dimensionality through dimensionality reduction and feature selection¶

Fig. 4.1 As the number of dimensions increase from 1 dimension on the left to 3 on the right, the number of points needed to cover the space increases exponentially. This means that for a fixed number of points (individuals), increasing the number of dimensions (SNPs) that we cluster on decreases the density of our space exponentially, making it exponentially harder to identify clusters.¶

The “curse of dimensionality” is a phrase coined by Richard E Bellman, during his work on dynamic programming[179], but has since proved relevant in many different mathematical and data-driven fields. While it’s used colloquially as a catch-all complaint about high-dimensional data, the “curse” specifically refers to the sparsity of data that occurs exponentially with an increase in dimensions. This leads to various problems in different fields[180] including general difficulty in reaching statistical significance and reduced usefulness of clustering, distance, and outlier metrics. This can easily be a problem in genetics since we have tens of thousands of genes, and hundreds of thousands of variants as dimensions that we may want to cluster over.

As Fig. 4.1 illustrates, increasing the number of dimensions that we cluster over makes it exponentially harder for us to identify clusters in the data given a fixed number of individuals in our cohort. In extreme cases, all samples or individuals look equally distant from each other in the sparse, high-dimensional space.

The curse of dimensionality can be partially reduced by choosing a clustering or outlier detection method which is more robust to the number of dimensions. However, these still have limits, and in order to overcome these, it is necessary to reduce the number of dimensions in some way, this process is called feature selection. This can be done through careful curation of important features, through variance cut-offs, or by dimensional reduction methods like Prinipal Component Analysis (PCA) or Multi-dimensional Scaling (MDS) which project the data into a different coordinate system and then discard some of the newly calculated dimensions.

Phenotype from Genotype

Introduction

Contents

4.1. Introduction¶

4.1.1. Motivation¶