4.3. Creating Snowflake inputs¶

This section describes the sources and pipelines for creating inputs to Snowflake, including necessary files that are packaged alongside it.

There are four main inputs required by Snowflake which must be created:

dcGO phenotype mapping file: high-quality mappings between protein (supra)domains and phenotype terms from 16 biomedical ontologies - provided as part of Snowflake
Background cohort: genetic data of diverse individuals (.vcf, VCF format) - default human option provided as part of Snowflake
Consequence file: deleteriousness scores per SNP (.Consequence, format specific to snowflake) - default human option provided as part of Snowflake
Input cohort: genetic data of person or people of interest (.vcf, VCF format) - provided by user

The dcGO mapping file works for all organisms and only needs remaking if dcGO or SUPERFAMILY have significant updates. Alternatively, a slimmed version can be created that contains only the ontologies of interest to a specific organism which reduces Snowflake’s running time, and here I explain how I made a slimmed version of this file for ontologies of interest to humans. The background cohort and consequence file must be created once for each organism/genome build. Here, since I have only used Snowflake in predicting human phenotypes, I walk through the creation of the human background cohort and consequence file. Lastly, the input cohort file must be created per input cohort. Here I run through the creation of a VCF file from 23andMe genotype files.

Consequences of input cohort SNPs

It is possible to create a consequence file for any cohort VCF file, i.e. it would be possible to make an input cohort consequence file. However, since Snowflake can only cluster our input cohort against the background for SNPs which overlap between the two cohorts, we only need to create one such file for either the input or the background. Here I’ve chosen to do it for the background cohort since then this file can be reused for many different input cohorts.

4.3.1. DcGO phenotype mapping file (human)¶

It is simple to create the dcGO mapping file since dcGO provides the required files for download on the SUPERFAMILY website. The website provides 2 files of high-coverage mappings for each ontology supported by dcGO: one that maps between SCOP domains and ontology terms, and another that maps between SCOP supradomains and ontology terms. Currently, Snowflake does not support the inclusion of dcGO supra-domain assignments.

High-coverage versus high-quality

DcGO provides two versions of it’s GO mappings: high-coverage and high-quality. The high-quality mapping contains only those associations between domains and phenotypes that are supported by single-domain proteins, while the high-coverage version contains some associations that have been inferred from known associations to multi-domain proteins. The high-coverage mapping contains roughly 10 times as many GO terms and 10 times as many protein domains compared to the high-quality version.

For all other (non-GO) ontologies, dcGO provides only the high-coverage version.

The high-coverage mappings were used in Snowflake to increase the coverage of the phenotype predictions, i.e. so that more SNPs could be included and more phenotypes.

The Mamalian Phenotype (MP) ontology[187] is based on phenotypes seen in mice. While mice do have many anatomical and phenotypic similarities to humans, and are thus often used as models for human diseases, there are of course differences and some terms don’t directly translate or are likely to be defined differently for humans (e.g. MP:0001933 abnormal litter size or MP:0002068 abnormal parental behaviour) and have different mechanisms behind them. There are also many terms which do make sense for humans (e.g. MP:0002166 altered tumour susceptibility or MP:0011117 abnormal susceptibility to weight gain) and may represent interesting results. The suitability of MP terms must be considered term-by-term, but the terms were included in the human phenotype mapping file to increase coverage of both proteins and phenotypes.

The ontologies that contain interpretable phenotype terms for humans are:

Disease Ontology[184] (DO)
Human Phenotype Ontology[112] (HP)
Gene Ontology[106] (GO), specifically terms from the biological_process subontology.
MEdical Subject Headings[188], (MESH) - on the dcGO website this is found under CTD diseases
Mammalian Phenotype Ontology[187] (MP)

Some of these ontologies aren’t designed only (or even primarily) for humans, but since they contain terms which are relevant to humans and the associations are made based on protein domains found in human proteins, these ontologies were chosen. While dcGO supports many other ontologies (e.g. Zebrafish and Xenopus ontologies) based on protein domains that can be found in humans, I made the decision that the trade-off between the additional time needed to run the phenotype predictor and sift through the results of these ontology terms was not worth the increase in coverage gained from whichever of these terms were meaningful.

To create the Snowflake input for humans, files for the five relevant ontologies were downladed (for DO, HP, GO, MESH, and MP), then concatenated, and the GO cellular component and GO molecular function terms were removed (these term identifiers were extracted from the GO_subontologies field of the Domain2GO_supported_only_by_all.txt file).

import pandas as pd
import os
from myst_nb import glue

notebook_name = '3-creating-inputs.ipynb' # __file__ doesn't work in a jupyter notebook

input_dir = 'data/dcGO_input/downloaded'
HP_loc = os.path.join(input_dir, 'Domain2HP.txt')
MESH_loc = os.path.join(input_dir, 'Domain2CD.txt')
DO_loc = os.path.join(input_dir, 'Domain2DO.txt')
GO_loc = os.path.join(input_dir, 'Domain2GO_supported_only_by_all.txt') # high coverage
MP_loc = os.path.join(input_dir, 'Domain2MP.txt')

new_col_names = ['domain_type', 
                 'domain_sunid', 
                 'ont_term_id', 
                 'ont_term_name', 
                 'ont_subontologies', 
                 'information_content', 
                 'annotation_origin_1direct_0inherited']

hp = pd.read_csv(HP_loc, sep='\t', comment= '#')
hp.columns = new_col_names
hp['ont_id'] = 'HP'

mesh = pd.read_csv(MESH_loc, sep='\t', comment= '#')
mesh.columns = new_col_names
mesh['ont_id'] = 'MESH'

do = pd.read_csv(DO_loc, sep='\t', comment= '#')
do.columns = new_col_names
do['ont_id'] = 'DO'

go = pd.read_csv(GO_loc, sep='\t', comment= '#')
go.columns = new_col_names
go = go[go['ont_subontologies'] == 'biological_process']
go['ont_id'] = 'GO_BP'

mp = pd.read_csv(MP_loc, sep='\t', comment= '#')
mp.columns = new_col_names
mp['ont_id'] = 'MP'

all_onts = pd.concat([hp, mesh, do, go, mp], ignore_index=True)
outfile = 'data/dcGO_input/created/human_po.txt'
with open(outfile, 'w') as f:
    f.write(f'# file created at {pd.Timestamp.now()} by {notebook_name}.\n')
    f.write('# file contains high coverage mappings for HP, MESH (CD), DO, and GOBP.\n')
    all_onts.to_csv(f, index=False)
    
display(all_onts.head())

def glue_counts_dcGO(df: pd.DataFrame, identifier: str):
    """
    Expects a DataFrame df that contains only ontology terms of interest.
    """
    glue(f"{identifier}_assignments", len(df))
    glue(f"{identifier}_terms", len(df['ont_term_id'].unique()))
    glue(f"{identifier}_domains", len(df['domain_sunid'].unique()))
    
    return None
    
glue_counts_dcGO(all_onts[all_onts['ont_id']=='HP'], 'HP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MESH'], 'MESH')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='DO'], 'DO')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='GO_BP'], 'GOBP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MP'], 'MP')
glue_counts_dcGO(all_onts, 'all')

from IPython.display import HTML 

lines_to_view = [8000,20000,50000]
view = HTML(all_onts.iloc[lines_to_view].to_html(index=False)) 
glue("excerpt-pofile", view)

domain_type	domain_sunid	ont_term_id	ont_term_name	ont_subontologies	information_content	annotation_origin_1direct_0inherited	ont_id
fa	55528	HP:0000925	Abnormality of the vertebral column	Phenotypic_abnormality	1.123852	0	HP
sf	53822	MESH:D013568	Pathological Conditions, Signs and Symptoms	CTD_diseases	0.581857	0	MESH
sf	48619	GO:0006658	phosphatidylserine metabolic process	biological_process	2.596963	1	GO_BP

Fig. 4.5 An excerpt of the dcGO mapping file human_po.txt, showing mappings between phenotype terms from a range of ontologies, and SCOP sun IDs.¶

Fig. 4.5 shows the type of content inside the phenotype ontology file. the information_content field is a measure of how specific the phenotype term is, for example Pathological Conditions, Signs and Symptoms is a very general term, while Abnormality of the vertebral column is more specific and phosphatidylserine metabolic process is the most specific shown.

Table 4.1 Summary statistics of the coverage of human-related phenotype terms.¶
Ontology	Number of dcGO assignments	Number of unique terms assigned	Number of unique domain `sunid`s assigned
HP: Human phenotype ontology	15984	1978	562
MeSH: Medical Subject headings	5761	620	473
DO: Disease ontology	7538	614	566
GOBP: Gene ontology, biological process	304064	9546	3135
MP: Mammalian phenotype ontology	24208	2528	719
Total (all included ontologies)	357555	15286	3146

Table 4.1 shows the number of assignments, unique phenotype terms, and unique domains covered in the dcGO phenotype mapping per ontology and in total. The file contains 357555 assignments, 15286 unique ontology terms, and 3146 unique domains. This constrains the proteins and phenotypes which Snowflake can make predictions about.

4.3.2. Background cohort¶

As described in the overview, Snowflake requires genetic “background” data to compare the individuals we are interested in against, i.e. so that meaningful clustering can take place. Although Snowflake has the functionality to be run with any background data set, the choice of data set is constrained since it must contain representative genetic data from from the entire population. The 1000 Genomes project[181,189] was therefore the perfect choice as it was designed to map common human genetic variation across diverse populations, despite being much smaller than some other cohorts available, such as UK Biobank[190].

4.3.2.1. Data acquisition: the 1000 Genomes project¶

The 1000 Genomes project[181,189] ran from 2008 to 2015, with the aim of comprehensively mapping common human genetic variation across diverse populations. The project sequenced individuals whole genomes, and released data in two main phases:

Phase 1: 37.9 million variants, in 1092 individuals, across 14 populations
Phase 3: 84.4 million variants, in 2504 individuals, across 26 populations

Data from the 1000 Genomes project are always used for the background cohort to Snowflake, with data from the 1000 Genomes project Phase 3[181] used as a default, and earlier experiments using data from Phase 1[189]. Here I describe the process of creating the phase 1 VCF file (1000G.vcf), but the same process is followed to create the larger phase 3 VCF file (2500G.vcf).

For both phases of the 1000 Genomes project, data are provided as VCF files for each chromosome. Both data sets are available through the International Genome Sequencing Resource[191] (IGSR); phase 1 VCFs (GRCh37) can be downloaded via FTP at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/, and phase 3 VCFs (GRCh37) can be downloaded at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

After downloading these files, the per-chromosome VCF files were combined into one large VCF file (for all chromosomes). The consequence file must be created at this stage, before the final input VCF can be created.

4.3.2.2. Create final input VCF¶

In order to create the final input VCF, we must remove SNPs from the VCF file that are not in the consequence file. This reduces the file size dramatically, since many recorded SNPs are either synonymous, nonsense, non-coding or multi-allelic. The VCF file is then Tabix-indexed[192], which increases the speed of Snowflake.

4.3.3. Consequence file¶

The consequence file contains the “consequences” of one amino acid being changed to another, which is used to weight which SNPs are the more important dimensions for clustering a phenotype. It also contains the mapping between SNPs and proteins, and SNPs and domain architecture.

This file is generated using:

Ensembl’s VEP tool to map from chromosomes and locations to SNP type (e.g. missense/nonsense/nonsynonymous), and to protein ID
FATHMM for scoring the deleteriousness of missense mutatations
SUPERFAMILY for mapping from protein IDs given by VEP to their domain architectures (SCOP family/superfamily)

4.3.3.1. Run the Variant Effect Predictor tool¶

In order to get a list of SNPs to input to FATHMM, we must first determine which SNPs in the background data are missense SNPs. This can be done using Ensembl’s VEP web tool[125], which takes a VCF as input. This first 10 columns of the combined VCF file is used as input to VEP since only these columns are needed, and making the file smaller reduces processing time.

4.3.3.2. Query FATHMM and SUPERFAMILY for the SNPs of interest¶

To get the consequence file, VEP output was first filtered to contain only biallelic missense SNPs, and was then used as input to query the FATHMM and SUPERFAMILY databases for the unweighted conservation scores and the domain assignments.

4.3.3.3. Summary¶

glue('consequence-rows', len(c_df.index))
glue('consequence-unique-snps', len(c_df.index.unique()))
glue('consequence-unique-proteins', len(c_df['ENSP_id'].unique())) 
glue('consequence-unique-families', len(c_df['FAMILY'].unique())) 
glue('consequence-unique-supfam', len(c_df['SUPERFAMILY'].unique()))

regs_snp = c_df.iloc[0].name
multi_family_snp = False
multi_supfam_snp = False
for chrom, pos in c_df[c_df.index.duplicated(keep=False)].index.unique():
    families_per_snp = len(c_df.loc[(chrom, pos)]['FAMILY'].unique())
    superfamilies_per_snp = len(c_df.loc[(chrom, pos)]['SUPERFAMILY'].unique())
    if families_per_snp > 1 and superfamilies_per_snp == 1:
        multi_family_snp = (chrom, pos)
    elif superfamilies_per_snp > 1:
        multi_supfam_snp = (chrom, pos)
        
    if multi_family_snp and multi_supfam_snp:
        break

index_to_view = [regs_snp, multi_family_snp, multi_supfam_snp]
glue("excerpt-consequence", c_df.loc[index_to_view])

		calls	snp_id	ENSP_id	prot_sub	HMM	position	ref_prob	mut_prob	SUPERFAMILY	Sup_e_val	FAMILY	Fam_e_val
#CHROM	POS
1	69224	A/T	NaN	ENSP00000334393	D45V	SF0037432	44	0.524940000001	4.26011	81321	2.93e-78	81320	0.0021
	1290695	G/C	NaN	ENSP00000307887	T136S	SF0042359	95	1.83872	2.56474	48726	3.14e-10	48727	0.0099
	1290695	G/C	NaN	ENSP00000344998	T35S	SF0047556	61	1.72866	2.45993	48726	0.000485	49159	0.077
	1290695	G/C	NaN	ENSP00000399229	T136S	SF0042359	95	1.83872	2.56474	48726	3.29e-10	48727	0.0099
	3392588	T/C	NaN	ENSP00000367629	Y479H	SF0040099	5	2.69907	4.0507	50729	3.73e-18	50730	0.017
	3392588	T/C	NaN	ENSP00000367629	Y479H	SF0050917	5	2.69907	4.0507	48065	3.53e-57	48066	8.09e-05
	3392588	T/C	NaN	ENSP00000408887	Y183H	SF0040099	5	2.69907	4.0507	50729	1.32e-18	50730	0.017
	3392588	T/C	NaN	ENSP00000408887	Y183H	SF0050917	5	2.69907	4.0507	48065	3.27e-38	48066	0.00025

Fig. 4.6 An excerpt of the consequence file 2500G.consequence, showing mappings between SNPs, protein IDs, mutant and reference probabilities from FATHMM, and SCOP sun IDs (via SUPERFAMILY).¶

As we can see in An excerpt of the consequence file 2500G.consequence, showing mappings between SNPs, protein IDs, mutant and reference probabilities from FATHMM, and SCOP sun IDs (via SUPERFAMILY)., the consequence file can contain multiple rows for the same SNPs since the SNP may appear in multiple proteins, since proteins can have overlapping reading frames. Less frequently, SNPs that fall in multiple proteins may also fall in more than one domain families or even superfamilies.

The output .consequence file defines the upper limit of the number of SNPs Snowflake can make phenotype predictions based on. In practice, this will often be a much smaller number as for SNPs to be used they must be:

measured in the input cohort (which is often genotyped, and therefore contains far fewer variants)
within protein domains that exist in the dcGO mapping file in addition to the consequence file.
a SNP where there is some variation between the background and input

Like the VCF file, the Consequence file is the indexed with Tabix[192] to increase Snowflake’s speed.

4.3.4. Input cohort¶

4.3.4.1. 23andMe file formats¶

The SNPs which are measured in a genotyping experiment depends on the chip used. Since launch, 23andMe have used a number of different Illumina chips for their genotyping service. These chips capture information for different SNPs, and vastly different numbers of SNPs. This means that in order to combine data from different chips, many loci can not be used.

23andMe have their own file formats, which must be convered to VCF in order to use Snowflake. One of these formats is tab separated and very similar to VCF, but for their API, 23andMe also stored genotype data in long strings (~1 million characters, twice the number of loci on the chip) of the form AAAACCTTTT__CC__, where every 2 nucleotides corresponds to a given SNP on the 23andMe chip with a rsID, chromosome and position. To convert this compressed format to VCF, 23andMe provided a genotype snp map file (snps.data), which is different for each 23andMe chip, that gives rsIDs, chromosome and position for each index, which looks like:

# index is a key for the /genomes/ endpoint (2 base pairs per index).
# strand is always +1.
index   snp     chromosome      chromosome_position
0       rs41362547      MT      10044
1       rs28358280      MT      10550
2       rs3915952       MT      11251

So, according to this file, the first two characters of the genotype string correspond to rs41362547, then the next two correspond to rs28358280.

The create_data function of snowflake contains the functionality to create a VCF file from 23andMe API string formats, for three different chips.

4.3.4.2. Genome builds¶

The files currently generated for snowflake use the GRC37 (hg19) human genome build. This is the version that is currently supported by FATHMM, 23andMe, and phase 1 of 1000G.

Phenotype from Genotype

Creating Snowflake inputs

Contents