7.3. Data Wrangling¶

Before the data sets could be combined, substantial data wrangling was necessary. The details of these processes - obtaining, checking, mapping identifiers, and excluding irrelevant data - are described in this section. Ontolopy was developed and used to do much of this mapping, and parts of the wrangling mentioned here form examples in the Ontolopy chapter.

The steps required to obtain consistently formatted and labelled data can be described as follows:

Obtaining the raw expression per gene for healthy human tissues
- Data acquisition
- (Where required) Mapping from transcript to gene
- (Where required) Filtering out disease samples
- (Where required) Filtering out non-human samples
Mapping from sample name to UBERON tissue using Ontolopy.
Mapping from UBERON tissues to tissue groups using Ontolopy.
Aggregating metadata

7.3.1. Obtaining raw expression per gene for healthy human tissues¶

As mentioned in Data Acquisition, for the HPA, GTeX and HDBR experiments, count data were available through the ExpressionAtlas R package[276], and the FANTOM dataset was downloaded directly.

7.3.1.1. Mapping from transcript to gene¶

This step was only required for the FANTOM dataset.

FANTOM provides mappings to gene IDs based on proximity of genes to peaks according to Ensembl. Gene expression was then calculated by summing over transcripts mapped to genes. The transcripts were already mapped to HGNC gene identifiers in the downloaded FANTOM file and Ensembl’s Biomart was used to obtain a mapping from HGNC gene identifiers to ENSG gene identifiers, in order to match the gene expression atlas format.

Any transcripts which mapped to multiple genes were discarded, as were any HGNC ids which did not map to ENSG ids.

7.3.1.2. Filtering out disease samples¶

The HDBR and HPA experiments contained only healthy samples.

GTEx Although GTEx contained clinical data, no disease-related phenotypes were removed from the data set, since the disease column contains only values of “normal” and the only clinical variables (as described in the clinical_variables column) in the dataset were sun exposure or lack thereof for skin tissues. I judged these to be within the normal range of environments that we would expect skin to be subjected to.

FANTOM The FANTOM sample ontology was used to remove samples which are models for diseases. Samples which are disease models are identified using the is_model_for relationship and these relationships are propagated to the children terms based on the is_a relationship. For example, FF:11558-120D1 (Fibroblast - skin spinal muscular atrophy, donor2) would be removed from the set of samples, since: FF:11558-120D1 (Fibroblast - skin spinal muscular atrophy, donor2) is_a FF:0000251 (human fibroblast - skin spinal muscular atrophy sample) is_model_for DOID:12377 (spinal muscular atrophy).

7.3.1.3. Filtering out non-human samples¶

The GTEx, HDBR, and HPA experiments contained only human samples.

FANTOM The FANTOM5 data set also contains non-human (mouse) samples. The FANTOM sample ontology (which was downloaded from here) was used to look-up which FANTOM samples are human samples, i.e. have an is_a relationship to the term FF:0000210 (human sample) directly or indirectly.

7.3.2. Mapping to UBERON¶

Mapping from samples to Uberon tissue required the development of a small Python package Ontolopy. To create input to this package, informal tissue names (e.g. blood, kidney) were taken from the experimental design files (or the human sample information file for FANTOM) to create a map of samples to informal tissue names. For FANTOM, the FANTOM ontology could also be used to create a more fine-grained mapping of samples to tissues based on FANTOM sample identifiers and/or cell type (CL) identifiers.

HPA The HPA samples were mapped using exact matches to Uberon names. Three types of sample did not have exact matches: transformed skin fibroblast, suprapubic skin, and ebv-transformed lymphocyte. I manually mapped suprapubic skin to UBERON:0001415 Skin of pelvis, and excluded the other two (corresponding to excluding 869 samples).

HDBR For HDBR, tissue names from the “organism part’ column of the column data file were matched to Uberon names and synonyms from the Uberon extended ontology. The 96 unmatched terms corresponding to mixed brain tissues and brain fragments were defaulted to the more general Uberon Brain term.

FANTOM Since an experimental design file could not be obtained for FANTOM via GxA, additional sample information was obtained via the FANTOM5 website, namely the human sample information file and the FANTOM5 ontology.

FANTOM also contains time courses of cell differentiation (cells changing from one type to another) as well measures of perturbed cells. Since these samples do not have a well-defined locality in the body given by cell or tissue type, they were not used in the combined dataset. Such samples were filtered out using the human sample information file.

Since the FANTOM data had both an ontology file and the human sample information file, both were used to map to Uberon. The disagreements between the two mappings revealed some inconsistencies with the data set: these are described in the previous section, as they demonstrate a potential use case for Ontolopy.

from plotly import graph_objects as go
from plotly.subplots import make_subplots

# from myst_nb import glue
import textwrap
import pandas as pd

funnel_plot_transcripts = pd.read_csv('data/experiments/fantom/funnel_plot_transcripts.tsv', sep='\t', index_col=0, header=0)
funnel_plot_samples = pd.read_csv('data/experiments/fantom/funnel_plot_samples.tsv', sep='\t', index_col=0, header=0)

fig = make_subplots(rows=1, cols=2, horizontal_spacing=0.3)


transcript_y = ["<br>".join(textwrap.wrap(x, 15)) for x in list(funnel_plot_transcripts.index)]
fig.add_trace(
    go.Funnel(name='Transcripts', 
              y=transcript_y,
              x=list(funnel_plot_transcripts['Number of transcripts'])),
    row=1,
    col=1
)

sample_y = ["<br>".join(textwrap.wrap(x, 27)) for x in list(funnel_plot_samples.index)]
fig.add_trace(
    go.Funnel(name='Samples', 
              y=sample_y,
              x=list(funnel_plot_samples['Number'])),
    row=1,
    col=2
)

fig.update_layout(showlegend=False,
                  autosize=False,
                  width=800,
                  height=400,
                  margin=dict(
                      l=50,
                      r=50,
                      b=50,
                      t=50,
                      pad=2
                  )
)

# fig.write_image('../images/fantom_funnel.png')
# fig.write_html('../images/fantom_funnel.html')
fig.show()
# glue('transcript-funnel', fig, display=False)

Fig. 7.3 Funnel plot showing the data cleaning pipeline for FANTOM transcripts/genes (left) and samples (right), along with the number which remained after each stage of data cleaning.¶

The amount of data that flows through the processing pipeline for the FANTOM5 dataset can be seen in Fig. 7.3.

7.3.3. Aggregating Metadata¶

To create consistent metadata for the samples (e.g. age, developmental stage, replicate status, etc), information was extracted from multiple sources (including GxA and additional data from each experiment), and sometimes manually curated or corrected.

HPA, HBDR, and GTEx: Metadata about the experiments was collected from multiple sources, primarily the column data files accessed via ExpressionAtlas. This metadata was used to describe the experimental design for ComBat. The metadata collected includes (where available), sample identifier, individual identifier, age (exact), age (range), developmental stage, tissue type (as Uberon term), sex, experiment, biological replicate identifier and technical replicate identifier.

Both age variables are given in years and may include negative values (e.g. for a developing fetus). The age (range) variable contains uneven ranges, since this allows there to be an age-related factor that is compatible across the experiments. These values had to be converted to common units, since they were incompatible between experiments, and age-related terms were missing in GxA for GTEx and HPA. For GTEx it was possible to acquire this information via its own website.

FANTOM: The metadata aggregation for the FANTOM dataset is described in detail in Section 6.5.

7.3.3.1. Tissue groups¶

While the data is in general mapped to the most specific Uberon terms possible, 10 broader tissue groups (e.g. “brain”, “connective tissue”) were identified by hand and the individual samples were mapped to these groups using Ontolopy’s Relations() function. This level of specificity is useful for comparing between experiments, since many experiments describe some tissues more specifically than others. For example, there are many FANTOM5 tissues labelled “brain”, but many HDBR experiments are labelled as more specific parts of the brain.

7.3.4. Final Experimental Design¶

import pandas as pd 
from myst_nb import glue

design_file = 'data/design.csv'
design = pd.read_csv(design_file, sep=',' ,index_col=0,
                     usecols=['Sample ID','Experiment','Tissue (UBERON)'])

tissue_groups_df = pd.read_csv('data/tissues_groups.csv', index_col=0)

design['Group name'] = design['Tissue (UBERON)'].map(tissue_groups_df['Group name'])

tissue_groups = list(design['Group name'].dropna().unique())
mod = pd.DataFrame(index = design.index)
for tissue_group in tissue_groups:
    # Note: zeros correspond to a normal fold change of 1 (as mod contains logfold changes)
    mod[tissue_group] = (design['Group name']==tissue_group).astype('int')

overall_design = mod.copy()
overall_design['Experiment'] = design['Experiment']
overall_design = overall_design.groupby('Experiment').sum()
overall_design['Samples per experiment'] = overall_design.sum(axis=1)
overall_design.loc['Total'] = overall_design.sum(axis=0)

percentage = overall_design.sum(axis=0)/float(overall_design.loc['Total', 'Samples per experiment'])
tissue_groups_to_discard = percentage.index[percentage<0.01]
overall_design.drop(columns=tissue_groups_to_discard, inplace=True)
mod.drop(columns=tissue_groups_to_discard, inplace=True)
tissue_groups.remove(tissue_groups_to_discard)

glue('design-balance', overall_design)
glue('mod-head', mod.head())

mod.to_csv('data/simulated/group_mod.csv')

	brain	central nervous system	connective tissue	respiratory system	cardiovascular system	digestive system	skin of body	reproductive system	renal system	muscle tissue	Samples per experiment
Experiment
FANTOM5	75	30	28	28	203	50	5	37	37	0	528
GTEx	1879	253	757	721	3328	2483	1302	1246	94	1493	13556
HDBR	356	222	0	0	0	0	0	0	0	0	578
HPA	0	3	7	8	9	56	6	42	10	9	158
Total	2310	508	792	757	3540	2589	1313	1325	141	1502	14820

Fig. 7.4 A table showing the number of samples in each category, by tissue group and experiment. Note that the design is not balanced: there are some categories that do not overlap at all.¶

Fig. 7.4 shows the experimental design of the combined data set. Since it is not balanced, it is not likely to be suitable for batch-correction algorithms such as ComBat or ComBat-Seq.

Phenotype from Genotype

Data Wrangling

Contents