5.1. Introduction

5.1.1. Motivation

In order to test Snowflake, I needed a data set that had a wealth of phenotype and genotype information.

5.1.2. The ALSPAC cohort study

The Avon Longitudinal Study of Parents and Children, ALSPAC[82] is a cohort of over 14,000 families from the Avon area with children born in 1991-1992. It is also known as “the Children of the 90s” study. Many of these families continue to be part of the study to this day, including some of their own children through an extension of the project: children of the children of the 90s (COCO90s).

A wealth of phenotype information (over 80,000 variables) has been collected from these families over the years, through a series of voluntary surveys and clinics, including genotyping of over 9000 children using 23andMe.

ALSPAC’s phenotype information, while extensive, is not mapped to phenotype terms in ontologies. All data held by ALSPAC can be searched for in the ALSPAC variable catalogue, after which it can then requested per variable or data type. At the time of writing, the cohort is around 30 years old, meaning that there is little information about phenotypes that manifest later in life, for example Alzheimer’s or heart disease. Many phenotype terms may not have any measurements, and there may be many variables associated with some others.

5.1.3. Experiment Design

Due to the identifiable nature of the data, our ethics application did not allow us to access many different phenotypes to perform a cross-phenotype validation of the predictor. Instead, we were granted access to the genotype data only first, then allowed to request a small number of phenotypes of interest after running Snowflake.

5.1.3.1. Choosing phenotypes of interest

I created a shortlist of phenotypes of interest by first restricting the set of scores to phenotypes for which Snowflake makes a prediction within the ALSPAC cohort, then ordering this list by the phenotype confidence score, to ensure that Snowflake could give confident predictions for phenotypes that were requested. I then mapped these to ALSPAC phenotypes by searching the ALSPAC variable catalogue. This resulted in the four : MP:0001501 Abnormal Sleep Pattern (measured using FJCI250 Sleep symptom score), MP:0001933 Abnormal litter size (measured by mz010a Pregnancy size), MESH:D001259 Ataxia (measured by kw2030 Child ever thought to have a problem with clumsiness/coordination), and HP:0001249 Intelligence/intellectual disability (measured by f8ws150 Child had special needs).