4. Phenotype prediction with Snowflake¶

\( \)

This chapter describes the Snowflake algorithm for phenotype prediction that I developed in collaboration with Jan Zaucha, Ben Smithers and Julian Gough. The development of snowflake resulted in a patent[4], of which I am an author, and later a paper[165] (the latest iteration of the tool is now called Nomaly). This chapter deals with the functionality and design of the Snowflake algorithm and it’s application to the ALSPAC dataset.

At it’s heart, Snowflake is a CLI tool and private Python package that allows the user to detect outliers for each phenotype of interest, according to their genotype. Individuals with unusual combinations of variants in highly conserved protein domains associated with a phenotype will score highly for (be indicated as likely to have) a phenotype.

The original idea for Snowflake was Julian’s, as well as the initial Perl implementation. The initial translation of the code from Perl to Python was carried out by Ben. Working from Ben’s translation, Jan and I both worked on increasing the algorithms functionality and robustness together, before forking the project into two different versions which we each took ownership of.

Contributions in this chapter

Writing part of the patent[4] relating to intrinsic dimensionality.
Software development to increase and test the algorithm’s functionality, including:
- With Jan and Ben:
  - Running with different formats and numbers of inputs and background cohorts
  - Dealing with missing calls
  - Development of tools to create input files for Snowflake
  - Improvements to memory-usage and speed
- And individually:
  - Creation of inputs to Snowflake
  - Alternative clustering and scoring methods, particularly for intrinsic dimensionality
  - Confidence score
  - Scoring outputs
  - Further improvements to speed and memory usage
  - Multiple imputation for missing calls
  - Inclusion of dimensionality reduction
  - Testing Snowflake on the ALSPAC cohort

Phenotype from Genotype

Phenotype prediction with Snowflake

4. Phenotype prediction with Snowflake¶