Data
Contents
6.3. Data¶
This section describes the gene expression data used for creating and validating Filip, including it’s provenance as well as any necessary data-cleaning.
I also describe the benchmarking data I used to develop Filip.
In addition to this, Filip requires the input of a protein function or phenotype prediction method, but this (and the data required for this) is described in Validation method.
6.3.1. Expression data: FANTOM5¶
The Filip method requires expression data to inform whether or not predictions should be filtered out. The FANTOM5 data set was chosen for this purpose.
FANTOM5 represents one of the most comprehensive collections of expression data in terms of tissue and cell type. It consists of expression data, captured using the CAGE technique. FANTOM5 collected a combination of human, mouse, health, and disease data, as well as time courses and cell perturbations. At the time of developing it was the latest data output of the FANTOM consortium.
My reasoning for choosing FANTOM5 data as the input gene expression data to test Filip was:
The data set has a good coverage of different tissue types, which I hoped would be helpful in Filip having a good coverage of predictions.
The data set has an ontology of samples, which is already linked to Uberon tissue terms, making the mapping process much easier.
For the purpose of Filip (getting measure of whether a cell is meaningfully expressed in a tissue of interest), choosing bulk RNA-Seq over scRNA-Seq makes sense, as it is a measure of many more cells.
I chose the version of the FANTOM5 data that:
had been reprocessed using the hg38 reference genome (the original FANTOM5 data was processed using hg19)[212].
contained annotated information about the samples, as this information could be used to aid in mapping.
available in TPM format.
6.3.1.1. Data files and acquisition¶
I downloaded the following files from the FANTOM website:
the FANTOM5 CAGE peaks expression data containing expression in counts per CAGE peak, and mappings to transcript id (ensembl ENST id), HGNC id and entrez gene ID. The long sample labels in this file are also a source of metadata (including sample identifiers (FANTOM accession numbers)).
FANTOM’s human sample information file containing text descriptions about sample, for example FANTOM accession numbers, tissue, age, sex, disease, etc, which is necessary for data cleaning.
the FANTOM5 ontology containing an obo file mapping between FANTOM accession numbers, Uberon and cell ontology (CL) terms.
6.3.1.2. Initial FANTOM5 data cleaning: sample info file¶
import helper_c05.fantom_sample_clean as fsc
# read in CAGE header and samples info files:
expression_header = fsc.read_CAGE_header()
samples_info = fsc.read_samples_info()
# calculate replicates information:
rep_info = fsc.get_tech_bio_reps(expression_header, samples_info)
# clean samples info files:
samples_info = fsc.restrict_samples(samples_info)
samples_info = fsc.update_sample_info_labels(samples_info, rep_info)
samples_info = fsc.clean_samples_info(samples_info, rep_info)
6.3.1.2.1. Sample categories¶
Restriction to primary cell and tissue samples:
The human FANTOM5 sample information file contains four categories of samples (in the Characteristics [Category]
field):
time courses: RNA extracted from samples being measured over time as cells change types during cell development and differentiation (
783
samples), e.g.'FF:12265-130A6'
- Lymphatic Endothelial cells response to VEGFC, 01hr20min, biol_rep1 (MM XIX - 6).primary cells: RNA extracted from cultures of cells recently isolated from tissues, before undergoing proliferation with nutrients specific to the cell type (
561
samples), e.g.'FF:11216-116B1'
- Urothelial cells, donor0.cell lines: RNA extracted from immortal cell lines (which unlike primary cells can keep undergoing division indefinitely) (268 samples).
tissues: RNA extracted from post-mortem tissues, which may be pooled or individual donors (
183
samples), e.g.'FF:10012-101C3'
- brain, adult, pool1.fractionations: RNA extracted from parts of cells (fractionations) (
21
samples), e.g.'FF:14310-155C8'
-'Fibroblast - Aortic Adventitial donor3 (cytoplasmic fraction)'
.
I restricted the data set to only tissues and primary cells, as these represent the closest approximations to in vivo biology. Immortal cell lines are often expressed differently than their primary counterparts[214,215], and time courses and fractionations do not represent any particular tissue.
Sample Type:
As mentioned, tissues can come from a pool, or individual donor.
This information can be found in the Charateristics [description]
field.
I combined this information with information from the Characteristics [Category]
field to create an additional Sample Type
field that describes whether a sample is a tissue - pool
, tissue - donor
or primary cells
sample.
Technical and biological replicates:
The FANTOM accession numbers are per sample, not per measurement.
Samples for which there are repeat measurements (technical replicates) will show up multiple times in the expression file.
FANTOM technical and biological replicates are indicated in long labels of the annotated expression FANTOM file, by the inclusion of “tech_rep” or “biol_rep” in the long sample labels e.g. tpm.Dendritic%20Cells%20-%20monocyte%20immature%20derived%2c%20donor1%2c%20tech_rep1.CNhs10855.11227-116C3.hg38.nobarcode
.
These were used to create additional fields for the human samples table.
Note: there is an error in the original transcript expression file for one of these identifiers (tpm.Dendritic%20Cells%20-%20monocyte%20immature%20derived%2c%20donor1%2c%20rep2.CNhs11062.11227-116C3.hg38.nobarcode
) such that it is missing the “tech” part of the the replicate label.
There is a hard-coded fix for this accession when I read in the input file and the FANTOM data curation team was informed.
After restricting the data set to primary cell and tissue type samples, there are 58
remaining samples which have biological replicates (between 2 and 3 replicates each), and 8
sets of samples with technological replicates (2 replicates each).
Age and age range:
The age of the sample source donor(s) is available through two fields in the human sample information file: Characteristics [Developmental stage]
, and Characteristics [Age]
.
These fields contain description-like text, which are somewhat inconsistent, for example, “3 year old child”, “3 years old child”, “25 year old”, “76” and “76 years old adult” all feature in the same column, amongst other errors.
These were standardised into a new field (Age (years)
).
This field does not seek to include multiple ages (i.e. when the sample comes from a pool of donors).
There is a complementary (i.e. no overlap) field (Age range (years)
), which contains age ranges for the 46
samples that contain multiple ages.
In both columns, some samples contain fetal samples, in which case, I convert age (range) to a negative decimal (converted to years before birth).
There were also some discrepancies between ages and developmental stages in the FANTOM human samples file.
For example, sample FF:10027-101D9
is labelled as thymus, adult, pool1 in the Description field, but as 0.5,0.5,0.83 years old infant in the Developmental Stage field.
Sample FF:10209-103G2
had an age of ‘M’ and a sex of ‘28’.
I reported both these discrepancies: and the latter has since been fixed in the FANTOM file, and for the former, I hardcode the age to NaN
.
Sex:
The Characteristics [Sex]
field contains information about the sex of the sample source donor(s).
Similarly to age, due to the consortium nature of FANTOM5, the entries of this field are not consistently labelled.
They undergo data cleaning into 4 categories: male, female, mixed (pool with both male and female samples), and unlabelled.
Disease and tissue mapping:
The disease status of samples (e.g. healthy/non-healthy) is not straight-forwardly labelled in the human sample file, so requires some basic text-mining (and cross-referencing with ontology terms).
Similarly, there is a Characteristics[Tissue]
field in the human samples file containing some manually mapped tissue types, but as I point out with an example in the exploratory data analysis, these do not contain ideal mappings for Filip.
The continued data processing of these components is described in the methdology section, after the introduction of uberon-py
(the package developed to do this).
6.3.1.3. Initial FANTOM5 data cleaning: expression file¶
from helper_c05 import fantom_tpm_clean as tpm_clean
# Create list of allowed primary + tissue samples to prevent reading in whole tpm file:
long_ids_to_keep, long_ids_to_new_ff = fsc.long_ids_to_restricted_samples(samples_info, expression_header)
dtypes = tpm_clean.get_dtypes(expression_header)
tpm_file = '../c08-combining/data/experiments/fantom/hg38_fair+new_CAGE_peaks_phase1and2_tpm_ann.osc.txt'
cage_tpm = tpm_clean.read_and_clean_tpm(tpm_file, long_ids_to_keep, long_ids_to_new_ff, dtypes)
protein_tpm = tpm_clean.get_protein_tpm(cage_tpm)
The tidied and restricted sample data, is combined with the FANTOM5 CAGE peaks expression data file and processed to create a protein-centric expression file. The CAGE peaks have already been cleaned by FANTOM (labeled as “fair”) meaning that CAGE peaks do not overlap.
CAGE peaks with associated proteins:
The CAGE peaks represent all kinds of mRNA transcripts, not only those which map to protein-coding gene, for example “RNA genes” representing pseduogenes or long non-coding RNAs.
The FANTOM file provides mappings to Uniprot IDs (uniprot_id
), and these are used to discard the CAGE peaks that do not map to protein-coding genes: this takes us from 209912
to 58592
rows (CAGE peaks).
CAGE peaks mapped to one gene only:
CAGE peaks are mapped to genes based on overlap with the gene, so it is not always clear which gene a CAGE peak maps to.
For simplicity, and to remove the potential of wrongly mapped genes being used in Filip, protein-coding CAGE peaks (those which are mapped to at least one uniprot_id
by FANTOM) but that map to multiple genes are removed.
These can be found by looking at either the hgnc_id
or entrezgene_id
gene identifier columns.
The choice of gene ID matters, since there are discrepancies between gene ID databases: in this case, choosing hgnc_id
finds all those CAGE peaks found by using entrezgene_id
, and more, so these are removed.
This represents a total of 579
CAGE peaks that map to multiple genes according to the given identifiers.
Proteins that map to multiple genes:
For Filip, the expression was calculated per protein (since it is protein function predictions that it is filtering), rather than per CAGE peak (summing the TPMs of all CAGE peaks mapped to a protein to get the total for that protein) as in the original data, or as is often presented per gene.
This gave 56554
rows of “protein expression” data.
Of these, there were then 59
rows of data (corresponding to 21
proteins) for which each protein maps to multiple genes.
This happens when different genes are translated to make identical protein products, for example the H4 human histone protein is encoded by 14 different genes at different loci, across three different chromosomes.
It used to be the case that Uniprot would map these genes to the same Uniprot ID, but more recently different Uniprot IDs are used to capture where the proteins came from.
These small number of rows were also removed for simplicity.
6.3.1.4. Exploratory Data Analysis¶
Samples:
After restricting the samples to those which are primary cells or tissues, there were 744
remaining samples.
from helper_c05 import fantom_sample_eda as f_eda
sex_donut, tissues_samples, nan_age_count, collaborators_providers = f_eda.create_plot_dfs(samples_info)
fig = f_eda.create_plotly_plots(samples_info, sex_donut, tissues_samples, nan_age_count, collaborators_providers)
fig.show('notebook')
Sample metadata: Looking at the FANTOM5 data (see Fig. 6.2), overall we see that the samples are very varied, across ages, sex, sample providers, and collaborators, although (d) shows that the majority of samples are primary cell samples, and very few are tissue - pool samples. Secondly, we can see that after careful cleaning, some metadata is missing, i.e. 38.4% of samples have unknown sex (a), most collaborators did not label the sample provider (b), and most samples do not have a labelled age (c).
Sample Tissues:
In Fig. 6.2 subplot (d), we can also note some interesting things about the tissue types provided by the Fantom Human Samples file. 30
primary cell samples are labeled ANATOMICAL SYSTEM. If we look closer at these samples, we can see that it is theoretically possible to map some of these samples to tissues (see Fig. 6.3).
There is also the question of how general or specific the human sample categories are. There are 101
samples which are mapped to blood (Fig. 6.3 (d)), but when we come to map the FANTOM5 tissues to phenotypes, this may be too broad a category. Similarly, there are 47
with less than three samples each (not pictured) that may be too narrow to map to phenotypes, and a more accurate picture of that phenotype would come from taking a more general tissue.
anatomical_system_samples = f_eda.anat_system_tbl(samples_info, chosen_samples = [2, 10, 15, 20])
Charateristics [description] | Characteristics[Tissue] | Characteristics [Category] | |
---|---|---|---|
Charateristics [ff_ontology] | |||
FF:11923-125H6 | Fibroblast - Gingival, donor7 (aggressive peri... | ANATOMICAL SYSTEM | primary cells |
FF:11933-125I7 | Olfactory epithelial cells, donor1 | ANATOMICAL SYSTEM | primary cells |
FF:11938-126A3 | gamma delta positive T cells, donor2 | ANATOMICAL SYSTEM | primary cells |
FF:11960-126C7 | Smooth muscle cells - airway, asthmatic, donor1 | ANATOMICAL SYSTEM | primary cells |
We can also see in Fig. 6.3 that this data set, though having undergone some data cleaning, still contains disease samples (e.g. “aggressive periodontitis”).
Protein-centric TPM:
from helper_c05 import fantom_tpm_eda as fte
fig = fte.create_distribution_plot(samples_info, protein_tpm)
fig.show('notebook')