4.4. Preprocessing

In the preprocessing stage, the snowflake preprocess command looks at all the inputs together, in order to filter them only for the useful parts before running the predictor.

In particular, snowflake preprocess always calculates the following in all running modes:

  • A list of SNPs associated with each term (.snp files)

  • Sets of equivalent terms, i.e. terms which have the exact same set of SNPs associated with them (.polist files)

And it also calculates the following if an input cohort is provided:

  • A list of overlapping SNPs between the background and input cohort

  • A combined VCF file containing only these snps, including dealing with ambiguous flips.

In this section, I will run through and explain the preprocessing step for the 1000 genomes only (no input cohort) as this represents an approximation of the maximum number of SNPs that snowflake can predict on, since the 1000 genomes project uses WGS.

4.4.1. Combining VCF files, a.k.a. missing SNPs and ambiguous flips

Due to the cost, far more humans have been genotyped than have had their whole genomes sequenced. Genotyped and WGS data look similar once in a VCF file, but the data cannot necessarily be treated the same in both cases.

4.4.1.1. Missing SNPs in VCF files

Many VCF files only store the differences between individuals in the file, a SNP being missing from a VCF file does not necessarily mean that the original sequencing or genotyping didn’t record the calls at that position.

If combining two genotyped files, we would want to discard all SNPs that are not measured by both chips, but when combining a genotype VCF file with a WGS VCF file, we usually want to keep all SNPs from the genotyped VCF (since these locations will also have been sequenced by WGS).

4.4.1.2. Ambiguous Flips

The majority of input data to the predictor is 23andMe data. In testing earlier versions of Snowflake with the 2500G background and a cohort of 23andMe genomes, it became clear that for many phenotypes, the background was forming a separate cluster to the cohort. This led to the realisation that there are 23andMe calls which had the opposite ratio of wild type:mutant than the 2500 genomes. Some further reading revealed this to be a known problem[187], which may be due to ambiguous flips[188].

Implausible distributions of SNPs in the input cohort (given the background) are therefore discarded using a cutoff.