6.3. Data

This section describes the gene expression data used for creating and validating Filip, including it’s provenance as well as any necessary data-cleaning.

I also describe the benchmarking data I used to develop Filip.

In addition to this, Filip requires the input of a protein function or phenotype prediction method, but this (and the data required for this) is described in Validation method.

6.3.1. Expression data: FANTOM5

The Filip method requires expression data to inform whether or not predictions should be filtered out. The FANTOM5 data set was chosen for this purpose.

FANTOM5 represents one of the most comprehensive collections of expression data in terms of tissue and cell type. It consists of expression data, captured using the CAGE technique. FANTOM5 collected a combination of human, mouse, health, and disease data, as well as time courses and cell perturbations. At the time of developing it was the latest data output of the FANTOM consortium.

My reasoning for choosing FANTOM5 data as the input gene expression data to test Filip was:

  • The data set has a good coverage of different tissue types, which I hoped would be helpful in Filip having a good coverage of predictions.

  • The data set has an ontology of samples, which is already linked to Uberon tissue terms, making the mapping process much easier.

  • For the purpose of Filip (getting measure of whether a cell is meaningfully expressed in a tissue of interest), choosing bulk RNA-Seq over scRNA-Seq makes sense, as it is a measure of many more cells.

I chose the version of the FANTOM5 data that:

  • had been reprocessed using the hg38 reference genome (the original FANTOM5 data was processed using hg19)[212].

  • contained annotated information about the samples, as this information could be used to aid in mapping.

  • available in TPM format.

6.3.1.1. Data files and acquisition

I downloaded the following files from the FANTOM website:

6.3.1.2. Initial FANTOM5 data cleaning: sample info file

import helper_c05.fantom_sample_clean as fsc

# read in CAGE header and samples info files:
expression_header = fsc.read_CAGE_header()
samples_info = fsc.read_samples_info()

# calculate replicates information:
rep_info = fsc.get_tech_bio_reps(expression_header, samples_info)

# clean samples info files:
samples_info = fsc.restrict_samples(samples_info)
samples_info = fsc.update_sample_info_labels(samples_info, rep_info)
samples_info = fsc.clean_samples_info(samples_info, rep_info)

6.3.1.2.1. Sample categories

Restriction to primary cell and tissue samples:

The human FANTOM5 sample information file contains four categories of samples (in the Characteristics [Category] field):

  • time courses: RNA extracted from samples being measured over time as cells change types during cell development and differentiation (783 samples), e.g. 'FF:12265-130A6' - Lymphatic Endothelial cells response to VEGFC, 01hr20min, biol_rep1 (MM XIX - 6).

  • primary cells: RNA extracted from cultures of cells recently isolated from tissues, before undergoing proliferation with nutrients specific to the cell type (561 samples), e.g. 'FF:11216-116B1' - Urothelial cells, donor0.

  • cell lines: RNA extracted from immortal cell lines (which unlike primary cells can keep undergoing division indefinitely) (268 samples).

  • tissues: RNA extracted from post-mortem tissues, which may be pooled or individual donors (183 samples), e.g. 'FF:10012-101C3' - brain, adult, pool1.

  • fractionations: RNA extracted from parts of cells (fractionations) (21 samples), e.g. 'FF:14310-155C8' - 'Fibroblast - Aortic Adventitial donor3 (cytoplasmic fraction)'.

I restricted the data set to only tissues and primary cells, as these represent the closest approximations to in vivo biology. Immortal cell lines are often expressed differently than their primary counterparts[214,215], and time courses and fractionations do not represent any particular tissue.

Sample Type:

As mentioned, tissues can come from a pool, or individual donor. This information can be found in the Charateristics [description] field. I combined this information with information from the Characteristics [Category] field to create an additional Sample Type field that describes whether a sample is a tissue - pool, tissue - donor or primary cells sample.

Technical and biological replicates:

The FANTOM accession numbers are per sample, not per measurement. Samples for which there are repeat measurements (technical replicates) will show up multiple times in the expression file. FANTOM technical and biological replicates are indicated in long labels of the annotated expression FANTOM file, by the inclusion of “tech_rep” or “biol_rep” in the long sample labels e.g. tpm.Dendritic%20Cells%20-%20monocyte%20immature%20derived%2c%20donor1%2c%20tech_rep1.CNhs10855.11227-116C3.hg38.nobarcode. These were used to create additional fields for the human samples table.

Note: there is an error in the original transcript expression file for one of these identifiers (tpm.Dendritic%20Cells%20-%20monocyte%20immature%20derived%2c%20donor1%2c%20rep2.CNhs11062.11227-116C3.hg38.nobarcode) such that it is missing the “tech” part of the the replicate label. There is a hard-coded fix for this accession when I read in the input file and the FANTOM data curation team was informed.

After restricting the data set to primary cell and tissue type samples, there are 58 remaining samples which have biological replicates (between 2 and 3 replicates each), and 8 sets of samples with technological replicates (2 replicates each).

Age and age range:

The age of the sample source donor(s) is available through two fields in the human sample information file: Characteristics [Developmental stage], and Characteristics [Age]. These fields contain description-like text, which are somewhat inconsistent, for example, “3 year old child”, “3 years old child”, “25 year old”, “76” and “76 years old adult” all feature in the same column, amongst other errors. These were standardised into a new field (Age (years)). This field does not seek to include multiple ages (i.e. when the sample comes from a pool of donors). There is a complementary (i.e. no overlap) field (Age range (years)), which contains age ranges for the 46 samples that contain multiple ages. In both columns, some samples contain fetal samples, in which case, I convert age (range) to a negative decimal (converted to years before birth).

There were also some discrepancies between ages and developmental stages in the FANTOM human samples file. For example, sample FF:10027-101D9 is labelled as thymus, adult, pool1 in the Description field, but as 0.5,0.5,0.83 years old infant in the Developmental Stage field. Sample FF:10209-103G2 had an age of ‘M’ and a sex of ‘28’. I reported both these discrepancies: and the latter has since been fixed in the FANTOM file, and for the former, I hardcode the age to NaN.

Sex:

The Characteristics [Sex] field contains information about the sex of the sample source donor(s). Similarly to age, due to the consortium nature of FANTOM5, the entries of this field are not consistently labelled. They undergo data cleaning into 4 categories: male, female, mixed (pool with both male and female samples), and unlabelled.

Disease and tissue mapping:

The disease status of samples (e.g. healthy/non-healthy) is not straight-forwardly labelled in the human sample file, so requires some basic text-mining (and cross-referencing with ontology terms). Similarly, there is a Characteristics[Tissue] field in the human samples file containing some manually mapped tissue types, but as I point out with an example in the exploratory data analysis, these do not contain ideal mappings for Filip.

The continued data processing of these components is described in the methdology section, after the introduction of uberon-py (the package developed to do this).

6.3.1.3. Initial FANTOM5 data cleaning: expression file

from helper_c05 import fantom_tpm_clean as tpm_clean

# Create list of allowed primary + tissue samples to prevent reading in whole tpm file:
long_ids_to_keep, long_ids_to_new_ff = fsc.long_ids_to_restricted_samples(samples_info, expression_header)
dtypes = tpm_clean.get_dtypes(expression_header)
tpm_file = '../c08-combining/data/experiments/fantom/hg38_fair+new_CAGE_peaks_phase1and2_tpm_ann.osc.txt'
cage_tpm = tpm_clean.read_and_clean_tpm(tpm_file, long_ids_to_keep, long_ids_to_new_ff, dtypes)
protein_tpm = tpm_clean.get_protein_tpm(cage_tpm)

The tidied and restricted sample data, is combined with the FANTOM5 CAGE peaks expression data file and processed to create a protein-centric expression file. The CAGE peaks have already been cleaned by FANTOM (labeled as “fair”) meaning that CAGE peaks do not overlap.

CAGE peaks with associated proteins:

The CAGE peaks represent all kinds of mRNA transcripts, not only those which map to protein-coding gene, for example “RNA genes” representing pseduogenes or long non-coding RNAs. The FANTOM file provides mappings to Uniprot IDs (uniprot_id), and these are used to discard the CAGE peaks that do not map to protein-coding genes: this takes us from 209912 to 58592 rows (CAGE peaks).

CAGE peaks mapped to one gene only:

CAGE peaks are mapped to genes based on overlap with the gene, so it is not always clear which gene a CAGE peak maps to. For simplicity, and to remove the potential of wrongly mapped genes being used in Filip, protein-coding CAGE peaks (those which are mapped to at least one uniprot_id by FANTOM) but that map to multiple genes are removed. These can be found by looking at either the hgnc_id or entrezgene_id gene identifier columns. The choice of gene ID matters, since there are discrepancies between gene ID databases: in this case, choosing hgnc_id finds all those CAGE peaks found by using entrezgene_id, and more, so these are removed. This represents a total of 579 CAGE peaks that map to multiple genes according to the given identifiers.

Proteins that map to multiple genes:

For Filip, the expression was calculated per protein (since it is protein function predictions that it is filtering), rather than per CAGE peak (summing the TPMs of all CAGE peaks mapped to a protein to get the total for that protein) as in the original data, or as is often presented per gene. This gave 56554 rows of “protein expression” data.

Of these, there were then 59 rows of data (corresponding to 21 proteins) for which each protein maps to multiple genes. This happens when different genes are translated to make identical protein products, for example the H4 human histone protein is encoded by 14 different genes at different loci, across three different chromosomes. It used to be the case that Uniprot would map these genes to the same Uniprot ID, but more recently different Uniprot IDs are used to capture where the proteins came from. These small number of rows were also removed for simplicity.

6.3.1.4. Exploratory Data Analysis

Samples:

After restricting the samples to those which are primary cells or tissues, there were 744 remaining samples.

from helper_c05 import fantom_sample_eda as f_eda
sex_donut, tissues_samples, nan_age_count, collaborators_providers = f_eda.create_plot_dfs(samples_info)
fig = f_eda.create_plotly_plots(samples_info, sex_donut, tissues_samples, nan_age_count, collaborators_providers)
fig.show('notebook')
../_images/blank.png

Fig. 6.2 (a) sex: a donut plot showing the sex labels of samples. (b) collaborators and providers: a stacked histogram showing the 10 most common collaborators, and 10 most common providers. (c) age: a histogram of age of sample donors (this does not include the 46 samples which have age ranges due to pooled donors of various ages). (d) tissues and sample types: a histogram showing the 50 most common tissues, spread across the different types of samples (primary cells, tissue donors, and tissue pools).

Sample metadata: Looking at the FANTOM5 data (see Fig. 6.2), overall we see that the samples are very varied, across ages, sex, sample providers, and collaborators, although (d) shows that the majority of samples are primary cell samples, and very few are tissue - pool samples. Secondly, we can see that after careful cleaning, some metadata is missing, i.e. 38.4% of samples have unknown sex (a), most collaborators did not label the sample provider (b), and most samples do not have a labelled age (c).

Sample Tissues: In Fig. 6.2 subplot (d), we can also note some interesting things about the tissue types provided by the Fantom Human Samples file. 30 primary cell samples are labeled ANATOMICAL SYSTEM. If we look closer at these samples, we can see that it is theoretically possible to map some of these samples to tissues (see Fig. 6.3).

There is also the question of how general or specific the human sample categories are. There are 101 samples which are mapped to blood (Fig. 6.3 (d)), but when we come to map the FANTOM5 tissues to phenotypes, this may be too broad a category. Similarly, there are 47 with less than three samples each (not pictured) that may be too narrow to map to phenotypes, and a more accurate picture of that phenotype would come from taking a more general tissue.

anatomical_system_samples = f_eda.anat_system_tbl(samples_info, chosen_samples = [2, 10, 15, 20])
Charateristics [description] Characteristics[Tissue] Characteristics [Category]
Charateristics [ff_ontology]
FF:11923-125H6 Fibroblast - Gingival, donor7 (aggressive peri... ANATOMICAL SYSTEM primary cells
FF:11933-125I7 Olfactory epithelial cells, donor1 ANATOMICAL SYSTEM primary cells
FF:11938-126A3 gamma delta positive T cells, donor2 ANATOMICAL SYSTEM primary cells
FF:11960-126C7 Smooth muscle cells - airway, asthmatic, donor1 ANATOMICAL SYSTEM primary cells

Fig. 6.3 An example of four ANANTOMICAL SYSTEM tissues, with tissue-specific cells, indicating that they could be mapped to tissues. For example sample FF:11922-125H5 is a gingival fibroblast, which are one of the main constituent cells of gum tissue.

We can also see in Fig. 6.3 that this data set, though having undergone some data cleaning, still contains disease samples (e.g. “aggressive periodontitis”).

Protein-centric TPM:

from helper_c05 import fantom_tpm_eda as fte
fig = fte.create_distribution_plot(samples_info, protein_tpm)
fig.show('notebook')
../_images/blank.png

Fig. 6.4 (a) box-plots showing the distribution of mean (TPM+1) values (note: logarithmic x axis) for the top 10 most common tissue and primary cell samples in the FANTOM5 human data. (b) density-plots showing the distribution of mean (TPM+1) values for the top 10 most common tissue and primary cell samples in the FANTOM5 human data on log-log axes.

As expected, Fig. 6.4 shows similar distributions of expression per tissue since the data is TPM normalised (since TPM normalises samples by sample library size to account for sequencing depth), with the characteristic long tail.

6.3.2. Cell, tissue, and phenotype mapping data

I also used the following datasets to aid in mapping to a common set of identifiers:

6.3.3. “Training” set: CAFA2

During development, I tested Filip by comparing DcGO only and Filip-plus-DcGO on data from the 2nd round of the CAFA competition: CAFA2. I chose to use the CAFA2 data because rather than a larger set of annotations (such as those available from SwissProt-KB or GOA) because it provided a way of validating on unknown targets. I.e. if I made predictions with DcgO using the version of GO from the time the challenge was launched, and I use the groundtruth data provided by CAFA2, then I could compare my results with those in the CAFA2 competition and I could look at my results on unknown targets. Although Filip was not literally “trained” on this data in a machine-learning sense (it doesn’t have any formalised parameters), I had access to the “groundtruth” data as I was developing CAFA2.

This was the most recent round of CAFA for which there were “groundtruth” data available at the time of development.

6.3.3.1. Data files and acquisition

The data consisted of:

  • CAFA2 targets: a list of proteins which the CAFA2 competition was soliciting predictions for.

  • CAFA2 ground truth data: experimentally validated associations between proteins and GO terms, divided by category.

Both of which could be found within the CAFA2 paper[116]’s Supplementary Material.

CAFA2 targets: CAFA2 provided targets from species across the tree of life: bacteria (10 species), archaea (7 species), and eukaryotes (10 species). Since tissue-specific gene expression data (which Filip requires) is not available for all species, I only used the human targets (in data/CAFA2-targets/eukarya/sp_species.9606.tfa).

The sp_species.9606.tfa file is a FASTA file containing information about 20257 proteins, each with a CAFA2 identifier (e.g. T96060000001), Uniprot Entry Name (the mnemonic identifier for the protein, e.g. 1433B_HUMAN), and the amino acid sequence, as in the following excerpt:

>T96060000001 1433B_HUMAN
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSS
WRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFY
LKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRLGLALNFSVFY
YEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGD
AGEGEN

CAFA2 benchmark: The CAGA2 benchmark data was available in the /data/benchmark directory of the CAFA2 Supplementary Data. It includes:

  • Lists of different types of targets for which there is groundtruth data (in /data/benchmark/lists): each line of these files is a CAFA2 protein identifier (e.g. T96060015767). The lists are separated into different files according to species, source phenotype ontology (e.g. HP, GO), and protein type (type1 = No Knowledge, type2 = Limited Knowledge). There are 7 files for human.

  • Groundtruth associations (in /data/benchmark/groundtruth): tab-separated CAFA protein identifiers and phenotype ontology terms, e.g. T96060000002    HP:0000348), organised into 8 separate files by source phenotype ontology, and whether the proteins are experimentally annotated to the exact term, or whether an association can be inferred due to a ontology relationship.