4.1.2.1. Phenotype predictors and variant prioritisation
Biology databases are home curated, open data that cover genomes across the tree of life, as well as cross-species ontologies of biological processes, diseases, and anatomical entities.
There are a number of recent phenotype prediction methods that have had some success in using these resources for either variant prioritisation or use as a clinical diagnostic tool.
There are a class of “knowledge-based” methods, which use knowledge from databases of experimental results (known associations between genes and phenotypes) as the basis for these predictions, for example Phen-Gen[160], dcGO[114], PhenoDigm[161], and PHIVE[162].
The better performing methods in this class, use associations between model organisms and orthologous genes, to leverage the wealth of information that is collected from these model organism experiments.
There are also “functional” methods, like FATHMM[118] and CADD[163], which instead use information about how the molecules and their function may change with different nucleotide or amino acid substitutions, as well as conservation metrics to prioritise variants.
These tools rank variants for deleteriousness, but do not link them to specific phenotypes.
Most successful methods of any kind now combine multiple sources of information, some combine both functional and knowledge-based sources.
This approach is used within Exomiser[164], which combines PHIVE with many other sources of information such as protein-protein interactions, cross-species phenotype associations, and variant frequency using a black-box classifier.
Phenolyzer[165] and Genomiser[166] also take similar approaches of combining many different sources of data.
The aim of these models is mostly to prioritise variants associated with diseases, and they are bench-marked by their ability to identify known variants.
Lists of known variants may be purpose-curated from the literature according to specific evidence, or may come from some subset of annotation databases (which in some cases the algorithm may have used as input data).
Each phenotype predictor often targets a specific use case (e.g. non-coding variants), and in combination with the varying validation methods used, it is difficult to compare the accuracy of all of these models directly.
For this reason, the CAFA competition is very useful in getting a more objective view of the capabilities of these kinds of tools.
Similar approaches have been used as clinical diagnostic tools.
PhenIX (Phenotypic Interpretation of eXomes)[167] is a version of PHIVE which is restricted to the “human disease-causing genome” (genes known to cause disease) to make it more suitable for clinical use in diagnosis of rare genetic diseases.
It also includes semantic similarity information between inputted symptoms and Human Phenotype Ontology terms, using the Phenomizer[168] algorithms.
For PhenIX the measure of success is that it enabled skilled clinicians to find diagnoses for 11 out of 40 (28%) patients with rare genetic diseases, who were not able to be diagnosed through other means.
While these examples are the most similar published work to Snowflake, they are all tested as variant prioritisation tools rather than phenotype predictors.
4.1.2.2. Clustering and outlier-detection in genetics
Clustering algorithms, particularly hierarchical methods, are commonly used in genetics for:
finding evolutionary relationships between DNA samples, for example, in reconstructing phylogenetic trees and mapping haplotypes within populations[169].
finding functional relationships between genes based on gene expression data[170,171].
For applications in group (1), individuals are generally separated into clusters based on their DNA variants, whereas for (2) samples are separated into clusters based on their gene expression.
Clustering methods are only very rarely used to cluster individuals in phenotype prediction or variant prioritisation tasks.
In one case, clustering individuals based on a combination of genotype and phenotype information has been applied to identify subtypes within emphysema[172] (a lung disease).
4.1.2.3. Overcoming the curse of dimensionality through dimensionality reduction and feature selection
The “curse of dimensionality” is a phrase coined by Richard E Bellman, during his work on dynamic programming[173], but has since proved relevant in many different mathematical and data-driven fields.
While it’s used colloquially as a catch-all complaint about high-dimensional data, the “curse” specifically refers to the sparsity of data that occurs exponentially with an increase in dimensions.
This leads to various problems in different fields[174] including general difficulty in reaching statistical significance and reduced usefulness of clustering, distance, and outlier metrics.
This can easily be a problem in genetics since we have tens of thousands of genes, and hundreds of thousands of variants as dimensions that we may want to cluster over.
As Fig. 4.1 illustrates, increasing the number of dimensions that we cluster over makes it exponentially harder for us to identify clusters in the data given a fixed number of individuals in our cohort.
In extreme cases, all samples or individuals look equally distant from each other in the sparse, high-dimensional space.
The curse of dimensionality can be partially reduced by choosing a clustering or outlier detection method which is more robust to the number of dimensions.
However, these still have limits, and in order to overcome these, it is necessary to reduce the number of dimensions in some way, this process is called feature selection.
This can be done through careful curation of important features, through variance cut-offs, or by dimensional reduction methods like Prinipal Component Analysis (PCA) or Multi-dimensional Scaling (MDS) which project the data into a different coordinate system and then discard some of the newly calculated dimensions.