Creating Snowflake inputs
Contents
4.3. Creating Snowflake inputs¶
This section describes the sources and pipelines for creating inputs to Snowflake, including necessary files that are packaged alongside it.
There are four main inputs required by Snowflake which must be created:
dcGO phenotype mapping file: high-quality mappings between protein (supra)domains and phenotype terms from 16 biomedical ontologies - provided as part of Snowflake
Background cohort: genetic data of diverse individuals (
.vcf
, VCF format) - default human option provided as part of SnowflakeConsequence file: deleteriousness scores per SNP (
.Consequence
, format specific tosnowflake
) - default human option provided as part of SnowflakeInput cohort: genetic data of person or people of interest (
.vcf
, VCF format) - provided by user
The dcGO mapping file works for all organisms and only needs remaking if dcGO or SUPERFAMILY have significant updates. Alternatively, a slimmed version can be created that contains only the ontologies of interest to a specific organism which reduces Snowflake’s running time, and here I explain how I made a slimmed version of this file for ontologies of interest to humans. The background cohort and consequence file must be created once for each organism/genome build. Here, since I have only used Snowflake in predicting human phenotypes, I walk through the creation of the human background cohort and consequence file. Lastly, the input cohort file must be created per input cohort. Here I run through the creation of a VCF file from 23andMe genotype files.
Consequences of input cohort SNPs
It is possible to create a consequence file for any cohort VCF file, i.e. it would be possible to make an input cohort consequence file. However, since Snowflake can only cluster our input cohort against the background for SNPs which overlap between the two cohorts, we only need to create one such file for either the input or the background. Here I’ve chosen to do it for the background cohort since then this file can be reused for many different input cohorts.
4.3.1. DcGO phenotype mapping file (human)¶
It is simple to create the dcGO mapping file since dcGO provides the required files for download on the SUPERFAMILY website. The website provides 2 files of high-coverage mappings for each ontology supported by dcGO: one that maps between SCOP domains and ontology terms, and another that maps between SCOP supradomains and ontology terms. Currently, Snowflake does not support the inclusion of dcGO supra-domain assignments.
High-coverage versus high-quality
DcGO provides two versions of it’s GO mappings: high-coverage and high-quality. The high-quality mapping contains only those associations between domains and phenotypes that are supported by single-domain proteins, while the high-coverage version contains some associations that have been inferred from known associations to multi-domain proteins. The high-coverage mapping contains roughly 10 times as many GO terms and 10 times as many protein domains compared to the high-quality version.
For all other (non-GO) ontologies, dcGO provides only the high-coverage version.
The high-coverage mappings were used in Snowflake to increase the coverage of the phenotype predictions, i.e. so that more SNPs could be included and more phenotypes.
The ontologies that contain interpretable phenotype terms for humans are:
Disease Ontology[178] (DO)
Human Phenotype Ontology[107] (HP)
Gene Ontology[101] (GO), specifically terms from the
biological_process
subontology.MEdical Subject Headings[182], (MESH) - on the dcGO website this is found under CTD diseases
Mammalian Phenotype Ontology[181] (MP)
Some of these ontologies aren’t designed only (or even primarily) for humans, but since they contain terms which are relevant to humans and the associations are made based on protein domains found in human proteins, these ontologies were chosen. While dcGO supports many other ontologies (e.g. Zebrafish and Xenopus ontologies) based on protein domains that can be found in humans, I made the decision that the trade-off between the additional time needed to run the phenotype predictor and sift through the results of these ontology terms was not worth the increase in coverage gained from whichever of these terms were meaningful.
To create the Snowflake input for humans, files for the five relevant ontologies were downladed (for DO, HP, GO, MESH, and MP), then concatenated, and the GO cellular component and GO molecular function terms were removed (these term identifiers were extracted from the GO_subontologies
field of the Domain2GO_supported_only_by_all.txt
file).
import pandas as pd
import os
from myst_nb import glue
notebook_name = '3-creating-inputs.ipynb' # __file__ doesn't work in a jupyter notebook
input_dir = 'data/dcGO_input/downloaded'
HP_loc = os.path.join(input_dir, 'Domain2HP.txt')
MESH_loc = os.path.join(input_dir, 'Domain2CD.txt')
DO_loc = os.path.join(input_dir, 'Domain2DO.txt')
GO_loc = os.path.join(input_dir, 'Domain2GO_supported_only_by_all.txt') # high coverage
MP_loc = os.path.join(input_dir, 'Domain2MP.txt')
new_col_names = ['domain_type',
'domain_sunid',
'ont_term_id',
'ont_term_name',
'ont_subontologies',
'information_content',
'annotation_origin_1direct_0inherited']
hp = pd.read_csv(HP_loc, sep='\t', comment= '#')
hp.columns = new_col_names
hp['ont_id'] = 'HP'
mesh = pd.read_csv(MESH_loc, sep='\t', comment= '#')
mesh.columns = new_col_names
mesh['ont_id'] = 'MESH'
do = pd.read_csv(DO_loc, sep='\t', comment= '#')
do.columns = new_col_names
do['ont_id'] = 'DO'
go = pd.read_csv(GO_loc, sep='\t', comment= '#')
go.columns = new_col_names
go = go[go['ont_subontologies'] == 'biological_process']
go['ont_id'] = 'GO_BP'
mp = pd.read_csv(MP_loc, sep='\t', comment= '#')
mp.columns = new_col_names
mp['ont_id'] = 'MP'
all_onts = pd.concat([hp, mesh, do, go, mp], ignore_index=True)
outfile = 'data/dcGO_input/created/human_po.txt'
with open(outfile, 'w') as f:
f.write(f'# file created at {pd.Timestamp.now()} by {notebook_name}.\n')
f.write('# file contains high coverage mappings for HP, MESH (CD), DO, and GOBP.\n')
all_onts.to_csv(f, index=False)
display(all_onts.head())
def glue_counts_dcGO(df: pd.DataFrame, identifier: str):
"""
Expects a DataFrame df that contains only ontology terms of interest.
"""
glue(f"{identifier}_assignments", len(df))
glue(f"{identifier}_terms", len(df['ont_term_id'].unique()))
glue(f"{identifier}_domains", len(df['domain_sunid'].unique()))
return None
glue_counts_dcGO(all_onts[all_onts['ont_id']=='HP'], 'HP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MESH'], 'MESH')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='DO'], 'DO')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='GO_BP'], 'GOBP')
glue_counts_dcGO(all_onts[all_onts['ont_id']=='MP'], 'MP')
glue_counts_dcGO(all_onts, 'all')
from IPython.display import HTML
lines_to_view = [8000,20000,50000]
view = HTML(all_onts.iloc[lines_to_view].to_html(index=False))
glue("excerpt-pofile", view)
domain_type | domain_sunid | ont_term_id | ont_term_name | ont_subontologies | information_content | annotation_origin_1direct_0inherited | ont_id |
---|---|---|---|---|---|---|---|
fa | 55528 | HP:0000925 | Abnormality of the vertebral column | Phenotypic_abnormality | 1.123852 | 0 | HP |
sf | 53822 | MESH:D013568 | Pathological Conditions, Signs and Symptoms | CTD_diseases | 0.581857 | 0 | MESH |
sf | 48619 | GO:0006658 | phosphatidylserine metabolic process | biological_process | 2.596963 | 1 | GO_BP |
Fig. 4.5 shows the type of content inside the phenotype ontology file.
the information_content
field is a measure of how specific the phenotype term is, for example Pathological Conditions, Signs and Symptoms is a very general term, while Abnormality of the vertebral column is more specific and phosphatidylserine metabolic process is the most specific shown.
Ontology |
Number of dcGO assignments |
Number of unique terms assigned |
Number of unique domain |
---|---|---|---|
HP: Human phenotype ontology |
15984 |
1978 |
562 |
MeSH: Medical Subject headings |
5761 |
620 |
473 |
DO: Disease ontology |
7538 |
614 |
566 |
GOBP: Gene ontology, biological process |
304064 |
9546 |
3135 |
MP: Mammalian phenotype ontology |
24208 |
2528 |
719 |
Total (all included ontologies) |
357555 |
15286 |
3146 |
Table 4.1 shows the number of assignments, unique phenotype terms, and unique domains covered in the dcGO phenotype mapping per ontology and in total. The file contains 357555 assignments, 15286 unique ontology terms, and 3146 unique domains. This constrains the proteins and phenotypes which Snowflake can make predictions about.
4.3.2. Background cohort¶
As described in the overview, Snowflake requires genetic “background” data to compare the individuals we are interested in against, i.e. so that meaningful clustering can take place. Although Snowflake has the functionality to be run with any background data set, the choice of data set is constrained since it must contain representative genetic data from from the entire population.
4.3.2.1. Data acquisition: the 1000 Genomes project¶
The 1000 Genomes project[175,183] ran from 2008 to 2015, with the aim of comprehensively mapping common human genetic variation across diverse populations. The project sequenced individuals whole genomes, and released data in two main phases:
Phase 1: 37.9 million variants, in 1092 individuals, across 14 populations
Phase 3: 84.4 million variants, in 2504 individuals, across 26 populations
Data from the 1000 Genomes project are always used for the background cohort to Snowflake, with data from the 1000 Genomes project Phase 3[175] used as a default, and earlier experiments using data from Phase 1[183].
Here I describe the process of creating the phase 1 VCF file (1000G.vcf
), but the same process is followed to create the larger phase 3 VCF file (2500G.vcf
).
For both phases of the 1000 Genomes project, data are provided as VCF files for each chromosome.
Both data sets are available through the International Genome Sequencing Resource[184] (IGSR); phase 1 VCFs (GRCh37) can be downloaded via FTP at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/integrated_call_sets/
, and phase 3 VCFs (GRCh37) can be downloaded at ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/
.
After downloading these files, the per-chromosome VCF files were combined into one large VCF file (for all chromosomes). The consequence file must be created at this stage, before the final input VCF can be created.
4.3.2.2. Create final input VCF¶
In order to create the final input VCF, we must remove SNPs from the VCF file that are not in the consequence file. This reduces the file size dramatically, since many recorded SNPs are either synonymous, nonsense, non-coding or multi-allelic. The VCF file is then Tabix-indexed[185], which increases the speed of Snowflake.
4.3.3. Consequence file¶
The consequence file contains the “consequences” of one amino acid being changed to another, which is used to weight which SNPs are the more important dimensions for clustering a phenotype. It also contains the mapping between SNPs and proteins, and SNPs and domain architecture.
This file is generated using:
Ensembl’s VEP tool to map from chromosomes and locations to SNP type (e.g. missense/nonsense/nonsynonymous), and to protein ID
FATHMM for scoring the deleteriousness of missense mutatations
SUPERFAMILY for mapping from protein IDs given by VEP to their domain architectures (SCOP family/superfamily)
4.3.3.1. Run the Variant Effect Predictor tool¶
In order to get a list of SNPs to input to FATHMM, we must first determine which SNPs in the background data are missense SNPs. This can be done using Ensembl’s VEP web tool[120], which takes a VCF as input. This first 10 columns of the combined VCF file is used as input to VEP since only these columns are needed, and making the file smaller reduces processing time.
4.3.3.2. Query FATHMM and SUPERFAMILY for the SNPs of interest¶
To get the consequence file, VEP output was first filtered to contain only biallelic missense SNPs, and was then used as input to query the FATHMM and SUPERFAMILY databases for the unweighted conservation scores and the domain assignments.
4.3.3.3. Summary¶
glue('consequence-rows', len(c_df.index))
glue('consequence-unique-snps', len(c_df.index.unique()))
glue('consequence-unique-proteins', len(c_df['ENSP_id'].unique()))
glue('consequence-unique-families', len(c_df['FAMILY'].unique()))
glue('consequence-unique-supfam', len(c_df['SUPERFAMILY'].unique()))
regs_snp = c_df.iloc[0].name
multi_family_snp = False
multi_supfam_snp = False
for chrom, pos in c_df[c_df.index.duplicated(keep=False)].index.unique():
families_per_snp = len(c_df.loc[(chrom, pos)]['FAMILY'].unique())
superfamilies_per_snp = len(c_df.loc[(chrom, pos)]['SUPERFAMILY'].unique())
if families_per_snp > 1 and superfamilies_per_snp == 1:
multi_family_snp = (chrom, pos)
elif superfamilies_per_snp > 1:
multi_supfam_snp = (chrom, pos)
if multi_family_snp and multi_supfam_snp:
break
index_to_view = [regs_snp, multi_family_snp, multi_supfam_snp]
glue("excerpt-consequence", c_df.loc[index_to_view])
calls | snp_id | ENSP_id | prot_sub | HMM | position | ref_prob | mut_prob | SUPERFAMILY | Sup_e_val | FAMILY | Fam_e_val | ||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
#CHROM | POS | ||||||||||||
1 | 69224 | A/T | NaN | ENSP00000334393 | D45V | SF0037432 | 44 | 0.524940000001 | 4.26011 | 81321 | 2.93e-78 | 81320 | 0.0021 |
1290695 | G/C | NaN | ENSP00000307887 | T136S | SF0042359 | 95 | 1.83872 | 2.56474 | 48726 | 3.14e-10 | 48727 | 0.0099 | |
1290695 | G/C | NaN | ENSP00000344998 | T35S | SF0047556 | 61 | 1.72866 | 2.45993 | 48726 | 0.000485 | 49159 | 0.077 | |
1290695 | G/C | NaN | ENSP00000399229 | T136S | SF0042359 | 95 | 1.83872 | 2.56474 | 48726 | 3.29e-10 | 48727 | 0.0099 | |
3392588 | T/C | NaN | ENSP00000367629 | Y479H | SF0040099 | 5 | 2.69907 | 4.0507 | 50729 | 3.73e-18 | 50730 | 0.017 | |
3392588 | T/C | NaN | ENSP00000367629 | Y479H | SF0050917 | 5 | 2.69907 | 4.0507 | 48065 | 3.53e-57 | 48066 | 8.09e-05 | |
3392588 | T/C | NaN | ENSP00000408887 | Y183H | SF0040099 | 5 | 2.69907 | 4.0507 | 50729 | 1.32e-18 | 50730 | 0.017 | |
3392588 | T/C | NaN | ENSP00000408887 | Y183H | SF0050917 | 5 | 2.69907 | 4.0507 | 48065 | 3.27e-38 | 48066 | 0.00025 |
As we can see in An excerpt of the consequence file 2500G.consequence, showing mappings between SNPs, protein IDs, mutant and reference probabilities from FATHMM, and SCOP sun IDs (via SUPERFAMILY)., the consequence file can contain multiple rows for the same SNPs since the SNP may appear in multiple proteins, since proteins can have overlapping reading frames. Less frequently, SNPs that fall in multiple proteins may also fall in more than one domain families or even superfamilies.
The output .consequence
file defines the upper limit of the number of SNPs Snowflake can make phenotype predictions based on.
In practice, this will often be a much smaller number as for SNPs to be used they must be:
measured in the input cohort (which is often genotyped, and therefore contains far fewer variants)
within protein domains that exist in the dcGO mapping file in addition to the consequence file.
a SNP where there is some variation between the background and input
Like the VCF file, the Consequence file is the indexed with Tabix[185] to increase Snowflake’s speed.
4.3.4. Input cohort¶
4.3.4.1. 23andMe file formats¶
The SNPs which are measured in a genotyping experiment depends on the chip used. Since launch, 23andMe have used a number of different Illumina chips for their genotyping service. These chips capture information for different SNPs, and vastly different numbers of SNPs. This means that in order to combine data from different chips, many loci can not be used.
23andMe have their own file formats, which must be convered to VCF in order to use Snowflake.
One of these formats is tab separated and very similar to VCF, but for their API, 23andMe also stored genotype data in long strings (~1 million characters, twice the number of loci on the chip) of the form AAAACCTTTT__CC__
, where every 2 nucleotides corresponds to a given SNP on the 23andMe chip with a rsID, chromosome and position.
To convert this compressed format to VCF, 23andMe provided a genotype snp map file (snps.data
), which is different for each 23andMe chip, that gives rsIDs, chromosome and position for each index, which looks like:
# index is a key for the /genomes/ endpoint (2 base pairs per index).
# strand is always +1.
index snp chromosome chromosome_position
0 rs41362547 MT 10044
1 rs28358280 MT 10550
2 rs3915952 MT 11251
So, according to this file, the first two characters of the genotype string correspond to rs41362547
, then the next two correspond to rs28358280
.
The create_data
function of snowflake contains the functionality to create a VCF file from 23andMe API string formats, for three different chips.
4.3.4.2. Genome builds¶
The files currently generated for snowflake use the GRC37 (hg19) human genome build. This is the version that is currently supported by FATHMM, 23andMe, and phase 1 of 1000G.