4.3. Creating Snowflake inputs

This section describes the sources and pipelines for creating inputs to Snowflake, including necessary files that are packaged alongside it.

There are four main inputs required by Snowflake which must be created:

  1. dcGO phenotype mapping file: high-quality mappings between protein (supra)domains and phenotype terms from 16 biomedical ontologies - provided as part of Snowflake

  2. Background cohort: genetic data of diverse individuals (.vcf, VCF format) - default human option provided as part of Snowflake

  3. Consequence file: deleteriousness scores per SNP (.Consequence, format specific to snowflake) - default human option provided as part of Snowflake

  4. Input cohort: genetic data of person or people of interest (.vcf, VCF format) - provided by user

The dcGO mapping file works for all organisms and only needs remaking if dcGO or SUPERFAMILY have significant updates. Alternatively, a slimmed version can be created that contains only the ontologies of interest to a specific organism which reduces Snowflake’s running time, and here I explain how I made a slimmed version of this file for ontologies of interest to humans. The background cohort and consequence file must be created once for each organism/genome build. Here, since I have only used Snowflake in predicting human phenotypes, I walk through the creation of the human background cohort and consequence file. Lastly, the input cohort file must be created per input cohort. Here I run through the creation of a VCF file from 23andMe genotype files.

Consequences of input cohort SNPs

It is possible to create a consequence file for any cohort VCF file, i.e. it would be possible to make an input cohort consequence file. However, since Snowflake can only cluster our input cohort against the background for SNPs which overlap between the two cohorts, we only need to create one such file for either the input or the background. Here I’ve chosen to do it for the background cohort since then this file can be reused for many different input cohorts.

4.3.1. DcGO phenotype mapping file (human)

It is simple to create the dcGO mapping file since dcGO provides the required files for download on the SUPERFAMILY website. The website provides 2 files of high-coverage mappings for each ontology supported by dcGO: one that maps between SCOP domains and ontology terms, and another that maps between SCOP supradomains and ontology terms. Currently, Snowflake does not support the inclusion of dcGO supra-domain assignments.

High-coverage versus high-quality

DcGO provides two versions of it’s GO mappings: high-coverage and high-quality. The high-quality mapping contains only those associations between domains and phenotypes that are supported by single-domain proteins, while the high-coverage version contains some associations that have been inferred from known associations to multi-domain proteins. The high-coverage mapping contains roughly 10 times as many GO terms and 10 times as many protein domains compared to the high-quality version.

For all other (non-GO) ontologies, dcGO provides only the high-coverage version.

The high-coverage mappings were used in Snowflake to increase the coverage of the phenotype predictions, i.e. so that more SNPs could be included and more phenotypes.

The ontologies that contain interpretable phenotype terms for humans are:

  • Disease Ontology[178] (DO)

  • Human Phenotype Ontology[107] (HP)

  • Gene Ontology[101] (GO), specifically terms from the biological_process subontology.

  • MEdical Subject Headings[182], (MESH) - on the dcGO website this is found under CTD diseases

  • Mammalian Phenotype Ontology[181] (MP)

Some of these ontologies aren’t designed only (or even primarily) for humans, but since they contain terms which are relevant to humans and the associations are made based on protein domains found in human proteins, these ontologies were chosen. While dcGO supports many other ontologies (e.g. Zebrafish and Xenopus ontologies) based on protein domains that can be found in humans, I made the decision that the trade-off between the additional time needed to run the phenotype predictor and sift through the results of these ontology terms was not worth the increase in coverage gained from whichever of these terms were meaningful.

To create the Snowflake input for humans, files for the five relevant ontologies were downladed (for DO, HP, GO, MESH, and MP), then concatenated, and the GO cellular component and GO molecular function terms were removed (these term identifiers were extracted from the GO_subontologies field of the Domain2GO_supported_only_by_all.txt file).

import pandas as pd
import os
from myst_nb import glue

notebook_name = '3-creating-inputs.ipynb' # __file__ doesn't work in a jupyter notebook

input_dir = 'data/dcGO_input/downloaded'
HP_loc = os.path.join(input_dir, 'Domain2HP.txt')
MESH_loc = os.path.join(input_dir, 'Domain2CD.txt')
DO_loc = os.path.join(input_dir, 'Domain2DO.txt')
GO_loc = os.path.join(input_dir, 'Domain2GO_supported_only_by_all.txt') # high coverage
MP_loc = os.path.join(input_dir, 'Domain2MP.txt')

new_col_names = ['domain_type', 
                 'domain_sunid', 
                 'ont_term_id', 
                 'ont_term_name', 
                 'ont_subontologies', 
                 'information_content', 
                 'annotation_origin_1direct_0inherited']

hp = pd.read_csv(HP_loc, sep='\t', comment= '#')
hp.columns = new_col_names
hp['ont_id'] = 'HP'

mesh = pd.read_csv(MESH_loc, sep='\t', comment= '#')
mesh.columns = new_col_names
mesh['ont_id'] = 'MESH'

do = pd.read_csv(DO_loc, sep='\t', comment= '#')
do.columns = new_col_names
do['ont_id'] = 'DO'

go = pd.read_csv(GO_loc, sep='\t', comment= '#')
go.columns = new_col_names
go = go[go['ont_subontologies'] == 'biological_process']
go['ont_id'] = 'GO_BP'

mp = pd.read_csv(MP_loc, sep='\t', comment= '#')
mp.columns = new_col_names
mp['ont_id'] = 'MP'

all_onts = pd.concat([hp, mesh, do, go, mp], ignore_index=True)
outfile = 'data/dcGO_input/created/human_po.txt'
with open(outfile, 'w') as f:
    f.write(f'# file created at {pd.Timestamp.now()} by {notebook_name}.\n')
    f.write('# file contains high coverage mappings for HP, MESH (CD), DO, and GOBP.\n')
    all_onts.to_csv(f, index=False)
    
display(all_onts.head())

def glue_counts_dcGO(df: pd.DataFrame, identifier: str):
    """
    Expects a DataFrame df that contains only ontology terms of interest.
    """
    glue(f"{identifier}_assignments", len(df))
    glue(f"{identifier}_terms", len(df['ont_term_id'].unique()))
    glue(f"{identifier}_domains", len(df['domain_sunid'].unique()))
    
    return None
    
glue_counts_dcGO(all_onts[all_onts['ont_id']=='HP'], 'HP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MESH'], 'MESH')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='DO'], 'DO')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='GO_BP'], 'GOBP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MP'], 'MP')
glue_counts_dcGO(all_onts, 'all')
from IPython.display import HTML 

lines_to_view = [8000,20000,50000]
view = HTML(all_onts.iloc[lines_to_view].to_html(index=False)) 
glue("excerpt-pofile", view)
domain_type domain_sunid ont_term_id ont_term_name ont_subontologies information_content annotation_origin_1direct_0inherited ont_id
fa 55528 HP:0000925 Abnormality of the vertebral column Phenotypic_abnormality 1.123852 0 HP
sf 53822 MESH:D013568 Pathological Conditions, Signs and Symptoms CTD_diseases 0.581857 0 MESH
sf 48619 GO:0006658 phosphatidylserine metabolic process biological_process 2.596963 1 GO_BP

Fig. 4.5 An excerpt of the dcGO mapping file human_po.txt, showing mappings between phenotype terms from a range of ontologies, and SCOP sun IDs.

Fig. 4.5 shows the type of content inside the phenotype ontology file. the information_content field is a measure of how specific the phenotype term is, for example Pathological Conditions, Signs and Symptoms is a very general term, while Abnormality of the vertebral column is more specific and phosphatidylserine metabolic process is the most specific shown.

Table 4.1 Summary statistics of the coverage of human-related phenotype terms.

Ontology

Number of dcGO assignments

Number of unique terms assigned

Number of unique domain sunids assigned

HP: Human phenotype ontology

15984

1978

562

MeSH: Medical Subject headings

5761

620

473

DO: Disease ontology

7538

614

566

GOBP: Gene ontology, biological process

304064

9546

3135

MP: Mammalian phenotype ontology

24208

2528

719

Total (all included ontologies)

357555

15286

3146

Table 4.1 shows the number of assignments, unique phenotype terms, and unique domains covered in the dcGO phenotype mapping per ontology and in total. The file contains 357555 assignments, 15286 unique ontology terms, and 3146 unique domains. This constrains the proteins and phenotypes which Snowflake can make predictions about.

4.3.2. Background cohort

As described in the overview, Snowflake requires genetic “background” data to compare the individuals we are interested in against, i.e. so that meaningful clustering can take place. Although Snowflake has the functionality to be run with any background data set, the choice of data set is constrained since it must contain representative genetic data from from the entire population.

4.3.2.1. Data acquisition: the 1000 Genomes project

The 1000 Genomes project[175,183] ran from 2008 to 2015, with the aim of comprehensively mapping common human genetic variation across diverse populations. The project sequenced individuals whole genomes, and released data in two main phases:

  • Phase 1: 37.9 million variants, in 1092 individuals, across 14 populations

  • Phase 3: 84.4 million variants, in 2504 individuals, across 26 populations

Data from the 1000 Genomes project are always used for the background cohort to Snowflake, with data from the 1000 Genomes project Phase 3[175] used as a default, and earlier experiments using data from Phase 1[183]. Here I describe the process of creating the phase 1 VCF file (1000G.vcf), but the same process is followed to create the larger phase 3 VCF file (2500G.vcf).

For both phases of the 1000 Genomes project, data are provided as VCF files for each chromosome. Both data sets are available through the International Genome Sequencing Resource[184] (IGSR); phase 1 VCFs (GRCh37) can be downloaded via FTP at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/, and phase 3 VCFs (GRCh37) can be downloaded at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/.

After downloading these files, the per-chromosome VCF files were combined into one large VCF file (for all chromosomes). The consequence file must be created at this stage, before the final input VCF can be created.

4.3.2.2. Create final input VCF

In order to create the final input VCF, we must remove SNPs from the VCF file that are not in the consequence file. This reduces the file size dramatically, since many recorded SNPs are either synonymous, nonsense, non-coding or multi-allelic. The VCF file is then Tabix-indexed[185], which increases the speed of Snowflake.

4.3.3. Consequence file

The consequence file contains the “consequences” of one amino acid being changed to another, which is used to weight which SNPs are the more important dimensions for clustering a phenotype. It also contains the mapping between SNPs and proteins, and SNPs and domain architecture.

This file is generated using:

  • Ensembl’s VEP tool to map from chromosomes and locations to SNP type (e.g. missense/nonsense/nonsynonymous), and to protein ID

  • FATHMM for scoring the deleteriousness of missense mutatations

  • SUPERFAMILY for mapping from protein IDs given by VEP to their domain architectures (SCOP family/superfamily)

4.3.3.1. Run the Variant Effect Predictor tool

In order to get a list of SNPs to input to FATHMM, we must first determine which SNPs in the background data are missense SNPs. This can be done using Ensembl’s VEP web tool[120], which takes a VCF as input. This first 10 columns of the combined VCF file is used as input to VEP since only these columns are needed, and making the file smaller reduces processing time.

4.3.3.2. Query FATHMM and SUPERFAMILY for the SNPs of interest

To get the consequence file, VEP output was first filtered to contain only biallelic missense SNPs, and was then used as input to query the FATHMM and SUPERFAMILY databases for the unweighted conservation scores and the domain assignments.

4.3.3.3. Summary

glue('consequence-rows', len(c_df.index))
glue('consequence-unique-snps', len(c_df.index.unique()))
glue('consequence-unique-proteins', len(c_df['ENSP_id'].unique())) 
glue('consequence-unique-families', len(c_df['FAMILY'].unique())) 
glue('consequence-unique-supfam', len(c_df['SUPERFAMILY'].unique()))

regs_snp = c_df.iloc[0].name
multi_family_snp = False
multi_supfam_snp = False
for chrom, pos in c_df[c_df.index.duplicated(keep=False)].index.unique():
    families_per_snp = len(c_df.loc[(chrom, pos)]['FAMILY'].unique())
    superfamilies_per_snp = len(c_df.loc[(chrom, pos)]['SUPERFAMILY'].unique())
    if families_per_snp > 1 and superfamilies_per_snp == 1:
        multi_family_snp = (chrom, pos)
    elif superfamilies_per_snp > 1:
        multi_supfam_snp = (chrom, pos)
        
    if multi_family_snp and multi_supfam_snp:
        break

index_to_view = [regs_snp, multi_family_snp, multi_supfam_snp]
glue("excerpt-consequence", c_df.loc[index_to_view])
calls snp_id ENSP_id prot_sub HMM position ref_prob mut_prob SUPERFAMILY Sup_e_val FAMILY Fam_e_val
#CHROM POS
1 69224 A/T NaN ENSP00000334393 D45V SF0037432 44 0.524940000001 4.26011 81321 2.93e-78 81320 0.0021
1290695 G/C NaN ENSP00000307887 T136S SF0042359 95 1.83872 2.56474 48726 3.14e-10 48727 0.0099
1290695 G/C NaN ENSP00000344998 T35S SF0047556 61 1.72866 2.45993 48726 0.000485 49159 0.077
1290695 G/C NaN ENSP00000399229 T136S SF0042359 95 1.83872 2.56474 48726 3.29e-10 48727 0.0099
3392588 T/C NaN ENSP00000367629 Y479H SF0040099 5 2.69907 4.0507 50729 3.73e-18 50730 0.017
3392588 T/C NaN ENSP00000367629 Y479H SF0050917 5 2.69907 4.0507 48065 3.53e-57 48066 8.09e-05
3392588 T/C NaN ENSP00000408887 Y183H SF0040099 5 2.69907 4.0507 50729 1.32e-18 50730 0.017
3392588 T/C NaN ENSP00000408887 Y183H SF0050917 5 2.69907 4.0507 48065 3.27e-38 48066 0.00025

Fig. 4.6 An excerpt of the consequence file 2500G.consequence, showing mappings between SNPs, protein IDs, mutant and reference probabilities from FATHMM, and SCOP sun IDs (via SUPERFAMILY).

As we can see in An excerpt of the consequence file 2500G.consequence, showing mappings between SNPs, protein IDs, mutant and reference probabilities from FATHMM, and SCOP sun IDs (via SUPERFAMILY)., the consequence file can contain multiple rows for the same SNPs since the SNP may appear in multiple proteins, since proteins can have overlapping reading frames. Less frequently, SNPs that fall in multiple proteins may also fall in more than one domain families or even superfamilies.

The output .consequence file defines the upper limit of the number of SNPs Snowflake can make phenotype predictions based on. In practice, this will often be a much smaller number as for SNPs to be used they must be:

  • measured in the input cohort (which is often genotyped, and therefore contains far fewer variants)

  • within protein domains that exist in the dcGO mapping file in addition to the consequence file.

  • a SNP where there is some variation between the background and input

Like the VCF file, the Consequence file is the indexed with Tabix[185] to increase Snowflake’s speed.

4.3.4. Input cohort

4.3.4.1. 23andMe file formats

The SNPs which are measured in a genotyping experiment depends on the chip used. Since launch, 23andMe have used a number of different Illumina chips for their genotyping service. These chips capture information for different SNPs, and vastly different numbers of SNPs. This means that in order to combine data from different chips, many loci can not be used.

23andMe have their own file formats, which must be convered to VCF in order to use Snowflake. One of these formats is tab separated and very similar to VCF, but for their API, 23andMe also stored genotype data in long strings (~1 million characters, twice the number of loci on the chip) of the form AAAACCTTTT__CC__, where every 2 nucleotides corresponds to a given SNP on the 23andMe chip with a rsID, chromosome and position. To convert this compressed format to VCF, 23andMe provided a genotype snp map file (snps.data), which is different for each 23andMe chip, that gives rsIDs, chromosome and position for each index, which looks like:

# index is a key for the /genomes/ endpoint (2 base pairs per index).
# strand is always +1.
index   snp     chromosome      chromosome_position
0       rs41362547      MT      10044
1       rs28358280      MT      10550
2       rs3915952       MT      11251

So, according to this file, the first two characters of the genotype string correspond to rs41362547, then the next two correspond to rs28358280.

The create_data function of snowflake contains the functionality to create a VCF file from 23andMe API string formats, for three different chips.

4.3.4.2. Genome builds

The files currently generated for snowflake use the GRC37 (hg19) human genome build. This is the version that is currently supported by FATHMM, 23andMe, and phase 1 of 1000G.