4. Phenotype prediction with Snowflake

This chapter describes the Snowflake algorithm for phenotype prediction that I developed in collaboration with Jan Zaucha, Ben Smithers and Julian Gough. The development of snowflake resulted in a patent[4], of which I am an author. This chapter deals with the functionality and design of the Snowflake algorithm, while the next describes it’s application to the ALSPAC dataset.

At it’s heart, Snowflake is a CLI tool and private Python package that allows the user to detect outliers for each phenotype of interest, according to their genotype. Individuals with unusual combinations of variants in highly conserved protein domains associated with a phenotype will score highly for (be indicated as likely to have) a phenotype.

The original idea for Snowflake was Julian’s, as well as the initial Perl implementation. The initial translation of the code from Perl to Python was carried out by Ben. Working from Ben’s translation, Jan and I both worked on increasing the algorithms functionality and robustness together, before forking the project into two different versions which we each took ownership of.

Contributions in this chapter

  • Writing part of the patent[4] relating to intrinsic dimensionality.

  • Software development to increase and test the algorithm’s functionality, including:

    • With Jan and Ben:

      • Running with different formats and numbers of inputs and background cohorts

      • Dealing with missing calls

      • Development of tools to create input files for Snowflake

      • Improvements to memory-usage and speed

    • And individually:

      • Creation of inputs to Snowflake

      • Alternative clustering and scoring methods, particularly for intrinsic dimensionality

      • Scoring outputs

      • Further improvements to speed and memory usage

      • Multiple imputation for missing calls

      • Inclustion of dimensionality reduction