Phenotype prediction with Snowflake
4. Phenotype prediction with Snowflake¶
This chapter describes the Snowflake algorithm for phenotype prediction that I developed in collaboration with Jan Zaucha, Ben Smithers and Julian Gough.
The development of snowflake
resulted in a patent[4], of which I am an author.
This chapter deals with the functionality and design of the Snowflake algorithm, while the next describes it’s application to the ALSPAC dataset.
At it’s heart, Snowflake is a CLI tool and private Python package that allows the user to detect outliers for each phenotype of interest, according to their genotype. Individuals with unusual combinations of variants in highly conserved protein domains associated with a phenotype will score highly for (be indicated as likely to have) a phenotype.
The original idea for Snowflake was Julian’s, as well as the initial Perl implementation. The initial translation of the code from Perl to Python was carried out by Ben. Working from Ben’s translation, Jan and I both worked on increasing the algorithms functionality and robustness together, before forking the project into two different versions which we each took ownership of.
Contributions in this chapter
Writing part of the patent[4] relating to intrinsic dimensionality.
Software development to increase and test the algorithm’s functionality, including:
With Jan and Ben:
Running with different formats and numbers of inputs and background cohorts
Dealing with missing calls
Development of tools to create input files for Snowflake
Improvements to memory-usage and speed
And individually:
Creation of inputs to Snowflake
Alternative clustering and scoring methods, particularly for intrinsic dimensionality
Scoring outputs
Further improvements to speed and memory usage
Multiple imputation for missing calls
Inclustion of dimensionality reduction