logo

Phenotype from Genotype

Front Matter

  • Declaration
  • Abstract
  • Acknowledgements
  • Table of Contents

Background

  • 1. Thesis style and philosophy
  • 2. How phenotype arises from genotype
    • 2.1. Big questions: What is genetically determined, and how?
    • 2.2. Biological molecules: DNA, RNA, Proteins and the central dogma of molecular biology.
    • 2.3. A closer look at DNA: Genomes, Genes, and Genetic Variation
    • 2.4. Looking more closely at proteins: function, structure and classification
    • 2.5. Phenotype
    • 2.6. Summary: how genotype and phenotype are linked
  • 3. How the link between genotype and phenotype is researched
    • 3.1. Sequencing and microarrays
    • 3.2. From genotype to phenotype: what is measured
    • 3.3. Ontologies
    • 3.4. Predictive computational methods
    • 3.5. Sources of bias in computational biology
    • 3.6. Proteome Quality Index
    • 3.7. Summary

Phenotype prediction

  • 4. Phenotype prediction with Snowflake
    • 4.1. Introduction
    • 4.2. Snowflake Algorithm
    • 4.3. Creating Snowflake inputs
    • 4.4. Preprocessing
    • 4.5. Considerations for Clustering SNPs
    • 4.6. Testing Snowflake on ALSPAC data
    • 4.7. Discussion

Tissue-specific gene expression

  • 5. Filtering computational predictions with tissue-specific expression information
    • 5.1. Introduction
    • 5.2. Algorithm
    • 5.3. Data
    • 5.4. Validation method
    • 5.5. Filip results
    • 5.6. Discussion and Future work
  • 6. Ontolopy
    • 6.1. Introduction
    • 6.2. Functionality
    • 6.3. Ontolopy tools and practices
    • 6.4. Example uses: mapping samples to diseases or phenotypes
    • 6.5. Example use: mapping samples to tissue-related phenotypes
    • 6.6. Discussion
    • 6.7. Future Work
  • 7. Combining RNA-seq datasets
    • 7.1. Introduction
    • 7.2. Data Acquisition
    • 7.3. Data Wrangling
    • 7.4. Results and discussion

Concluding remarks

  • 8. Concluding remarks

End Matter

  • Appendix
  • Bibliography
Theme by the Executable Book Project
Contents
  • 6.5.1. Creating sample-to-tissue mappings
    • 6.5.1.1. Load data and pre-filter
    • 6.5.1.2. Mapping by ontology
    • 6.5.1.3. Mapping by name
    • 6.5.1.4. Combining mappings
    • 6.5.1.5. Finding inconsistencies
      • 6.5.1.5.1. Finding samples that are missing annotations to tissues
      • 6.5.1.5.2. Missing Uberon or CL annotation
      • 6.5.1.5.3. Mislabelled sample
      • 6.5.1.5.4. Imprecise annotation to tissue
    • 6.5.1.6. Mapping overview
  • 6.5.2. Creating tissue-to-phenotype mappings
    • 6.5.2.1. Propagating relationships up the tree using part_of
    • 6.5.2.2. Propagating “down” the tree: has_part
    • 6.5.2.3. Propagating down the tree: inverse of part_of
    • 6.5.2.4. Combining previous mappings
  • 6.5.3. Creating sample-to-tissue-phenotype mappings
    • 6.5.3.1. Final mapping

Example use: mapping samples to tissue-related phenotypes

Contents

  • 6.5.1. Creating sample-to-tissue mappings
    • 6.5.1.1. Load data and pre-filter
    • 6.5.1.2. Mapping by ontology
    • 6.5.1.3. Mapping by name
    • 6.5.1.4. Combining mappings
    • 6.5.1.5. Finding inconsistencies
      • 6.5.1.5.1. Finding samples that are missing annotations to tissues
      • 6.5.1.5.2. Missing Uberon or CL annotation
      • 6.5.1.5.3. Mislabelled sample
      • 6.5.1.5.4. Imprecise annotation to tissue
    • 6.5.1.6. Mapping overview
  • 6.5.2. Creating tissue-to-phenotype mappings
    • 6.5.2.1. Propagating relationships up the tree using part_of
    • 6.5.2.2. Propagating “down” the tree: has_part
    • 6.5.2.3. Propagating down the tree: inverse of part_of
    • 6.5.2.4. Combining previous mappings
  • 6.5.3. Creating sample-to-tissue-phenotype mappings
    • 6.5.3.1. Final mapping

6.5. Example use: mapping samples to tissue-related phenotypes¶

This section presents a more sizable example of using Ontolopy, the task it was developed for: mapping from samples to tissue-related phenotypes. This is a substantial challenge, in part because it requires:

  1. Mapping over many different types of ontology terms (FF, UBERON, CL, GO).

  2. Using more complex relations such as part_of.

  3. Mapping from text as well as using the mapping from ontology, which then requires combining the two mappings into an overall mapping and investigating any disagreements between mappings.

This task can be generally divided into the following parts:

  1. Creating sample (FF) to tissue (UBERON) mapping, including looking at disagreements between mappings.

  2. Tissue (UBERON) to phenotype (GO Biological Process - GOBP) mapping.

  3. Combining the above, to create the final sample (FF) to tissue-related phenotype (GO) mapping.

We are using the same input data as described in the previous section for this example.

6.5.1. Creating sample-to-tissue mappings¶

To create the mapping between FANTOM sample ID (FF:XXXXX-XXXXX) and tissue (UBERON:XXXXXX), we use the Uberon class. The Uberon class has three useful functions for creating this mapping:

  1. sample_map_by_ont: creates a mapping via ontology.

  2. sample_map_by_name: creates a mapping via sample or tissue names.

  3. get_overall_tissue_mappings: combines the two mappings to create a more comprehensive overall mapping.

6.5.1.1. Load data and pre-filter¶

In order to do this, I load the input FANTOM5 ontology and sample information files.

import ontolopy as opy
import pandas as pd 
from myst_nb import glue
import time
import numpy as np
import logging
import plotly.graph_objects as go_
from plotly.subplots import make_subplots

notebook_start = time.time()

# Read in files:
# -------------
fantom_obo_file = '../c08-combining/data/experiments/fantom/ff-phase2-170801.obo.txt'
fantom_samples_info_file = '../c08-combining/data/experiments/fantom/fantom_humanSamples2.0.csv'
uberon_obo_file = '../c08-combining/data/uberon_ext_210321.obo' 

# Uberon OBO:
uberon_obo = opy.load_obo(
    file_loc=uberon_obo_file, 
    ont_ids=['GO', 'UBERON','CL','FMA'], 
)

uberon_obo_tissue_only = opy.load_obo(
    file_loc=uberon_obo_file, 
    ont_ids=['GO', 'UBERON','FMA'], 
)

# FANTOM OBO:
fantom_obo = opy.load_obo(
    file_loc=fantom_obo_file, 
    ont_ids=['CL', 'FF', 'GO', 'UBERON', 'DOID'],
)

fantom_obo_no_cl = opy.load_obo(
    file_loc=fantom_obo_file, 
    ont_ids=['FF', 'GO', 'UBERON', 'DOID'],
)

# FANTOM Samples Info file:
samples_info = pd.read_csv(fantom_samples_info_file, index_col=1)

6.5.1.2. Mapping by ontology¶

The sample_map_by_ont function is a wrapper function which calls relations.Relations, and excludes too-general Uberon tissues such as anatomical structure, tissue, anatomical system, embryo, and multi fate stem cell. We use a merged sample (FANTOM) and tissue (Uberon) ontology as input.

The inclusion of the Cell Ontology (CL) terms (which are included in the Uberon OBO file) is important to retrieve a mapping for as many samples as possible. Table 6.1 shows how the inclusion of CL terms in the input ontology significantly changes the mapping coverage, and that mapping via ontology alone (with CL terms used) is fairly good.

# Find mappings:
# -------------

total = len(samples_info.index.unique())

def get_mapped(mapping):
    mapped = mapping[~mapping['to'].isna()]
    unmapped = mapping[mapping['to'].isna()]
    return mapped, unmapped

# Uberon tissue only (no CL cells)
start = time.time()
merged_tissue_only = opy.uberon_from_obo(uberon_obo_tissue_only.merge(fantom_obo_no_cl))
sample_to_tissue_mapping_tissue_only = merged_tissue_only.sample_map_by_ont(samples_info.index)
mapped_nocl, unmapped_nocl = get_mapped(sample_to_tissue_mapping_tissue_only)
glue("mapped-uberon-no-cl",f"{len(mapped_nocl)} ({100*len(mapped_nocl)/float(total):.2f}%)", False)
glue("unmapped-uberon-no-cl",f"{len(unmapped_nocl)} ({100*len(unmapped_nocl)/float(total):.2f}%)", False)
glue("time-uberon-no-cl", f"{time.time()-start:.2f} seconds", False)

# Uberon tissue and CL cells 
start = time.time()
merged = opy.uberon_from_obo(uberon_obo.merge(fantom_obo))
sample_to_tissue_mapping = merged.sample_map_by_ont(samples_info.index)
mapped_cl, unmapped_cl = get_mapped(sample_to_tissue_mapping)
glue("mapped-uberon-cl",f"{len(mapped_cl.index.unique())} ({100*len(mapped_cl.index.unique())/float(total):.2f}%)", False)
glue("unmapped-uberon-cl",f"{len(unmapped_cl.index.unique())} ({100*len(unmapped_cl.index.unique())/float(total):.2f}%)", False)
glue("time-uberon-cl", f"{time.time()-start:.2f} seconds", False)
Table 6.1 Table comparing the difference in coverage (mappable samples) for different mapping techniques.¶

Mapping name

Number (and percentage) of all mapped samples

Number (and percentage) of all unmapped samples

Run time

By ontology: using Uberon tissues only

441 (24.28%)

1375 (75.72%)

0.11 seconds

By ontology: using Uberon tissues and CL cells

1457 (80.23%)

359 (19.77%)

0.13 seconds

By name: using tissue column

1263 (69.55%)

553 (30.45%)

8.99 seconds

Combined: combining by name and by ontology including CL

1652 (90.97%)

164 (9.03%)

N/A

# Show some examples of things that aren't matched that should be (not mappable by ontology alone), e.g. caudate nucleus
tissue_col = 'Characteristics[Tissue]'

expect_to_map = samples_info[samples_info.index.isin(unmapped_cl.index) & ( (~samples_info[tissue_col].isna()) & ~samples_info[tissue_col].isin(['ANATOMICAL SYSTEM', 'UNDEFINED_TISSUE_TYPE', 'unclassifiable']))]
not_expect_map = samples_info[samples_info.index.isin(unmapped_cl.index) & (samples_info[tissue_col].isna() | samples_info[tissue_col].isin(['ANATOMICAL SYSTEM', 'UNDEFINED_TISSUE_TYPE', 'unclassifiable']))]

glue("not_expect_map", len(not_expect_map), False)
glue("expect_map", len(expect_to_map), False)
glue("num_expect_map_tissues", len(expect_to_map[tissue_col].unique()), False)

def list_to_text(lst):
    text = ', '.join(lst)
    return text[:text.rindex(', ') + 1] + ' and' + text[text.rindex(', ')+1:]

glue("tissues_list", list_to_text(expect_to_map[tissue_col].unique()))

chosen_examples = ['FF:10379-105H1',  # caudate nucleus
                   'FF:10422-106C8',  # blood (lymphoma)
                   'FF:10558-107I9', # osteoblast
                   'FF:11794-124C3', # t-cell
                  ]
cell_col = 'Characteristics [Cell type]'
desc_col = 'Charateristics [description]'
expect_to_map_table = expect_to_map.loc[chosen_examples][[desc_col, tissue_col, cell_col]]
glue("expect-to-map-table", expect_to_map_table, False)

The unmapped samples do contain some samples that we wouldn’t expect to be able to map to tissue (defined as those that do not have a tissue provided in the samples information file, or that are labelled as unclassifiable, UNDEFINED_TISSUE_TYPE, or ANATOMICAL SYSTEM): these account for 161 of the 359 (19.77%) unmapped samples. This, however, leaves 198 samples which we would expect to map, spread across 16 tissues (caudate nucleus, blood, placenta, bone, chorioamniotic membrane, ovary, lung, retroperitoneum, soft tissue, skin, connective tissue, thyroid, stomach, skeletal muscle, Buffy coat, and bone marrow).

Charateristics [description] Characteristics[Tissue] Characteristics [Cell type]
from
FF:10379-105H1 caudate nucleus, adult, donor10258 caudate nucleus CELL MIXTURE - tissue sample
FF:10422-106C8 Burkitt's lymphoma cell line:DAUDI blood b cell
FF:10558-107I9 osteosarcoma cell line:HS-Os-1 bone osteoblast
FF:11794-124C3 CD4+CD25+CD45RA- memory regulatory T cells exp... blood T cell

Fig. 6.5 Four FANTOM5 samples that we would expect to be able to map using Ontolopy based on the information in the samples information file (selected columns shown here), but that do not map using Ontolopy with the FANTOM5 ontology.¶

Fig. 6.5 show four examples of such samples. The existence of such tissues, means that mapping via name as well as by ontology could prove useful.

6.5.1.3. Mapping by name¶

# map by name
uberon_obo_tissue_only = opy.uberon_from_obo(uberon_obo_tissue_only)

start = time.time()
mapped_by_tissue_name = uberon_obo_tissue_only.sample_map_by_name(samples_info[tissue_col], xref='FMA')
mapped_name, unmapped_name = get_mapped(mapped_by_tissue_name)

glue("time-name", f"{time.time()-start:.2f} seconds", False)
glue("mapped-name", f"{len(mapped_name.index.unique())} ({100*len(mapped_name.index.unique())/float(total):.2f}%)", False)
glue("unmapped-name", f"{len(unmapped_name.index.unique())} ({100*len(unmapped_name.index.unique())/float(total):.2f}%)", False)

The Uberon.sample_map_by_name function simply looks up the strings provided (in this case those from the Characteristics[Tissue] column of the sample information file) and checks if any Uberon terms in the provided ontology has a matching name or synonym. The term name is preferred over synonyms, and where there are no exactly matching term names, but there are multiple possible synonyms (e.g. bladder is a synonym for urinary bladder and bladder organ), we decide by whether either of the terms are linked to the Foundational Model of Anatomy (FMA) ontology, as this is a human-specific ontology by using the xref='FMA' option. Since we are only looking for Uberon terms, it doesn’t make any difference whether we use the “tissue only” or “including CL” versions of the Uberon ontology that we read in earlier, aside from a negligible difference in run time).

unmapped_name['name_matched_on'].unique()
array(['unclassifiable', 'hippocampus', 'medial temporal gyrus', nan,
       'olfactory apparatus', 'eye - vitreous humor',
       'eye - muscle inferior rectus', 'skeletal muscle - soleus muscle',
       'Skin - palm', 'tongue epidermis (fungiform papillae)',
       'eye - muscle superior', 'eye - muscle lateral',
       'eye - muscle medial',
       'Fingernail (including nail plate, eponychium and hyponychium)',
       'bone', 'chorioamniotic membrane', 'soft tissue', 'palate',
       'cartilage', 'Buffy coat', 't3', 't4', 't5', 't6', 't7', 't8',
       't9', 't10', 't11', 't12', 't13', 't14', 't15', 't16', 't17',
       't18', 't19', 't20', 't21', 't22', 't23', 't24', 't25', 't26',
       'adipose', 'UNDEFINED_TISSUE_TYPE'], dtype=object)

Ontolopy restricts the tissues mapped by name to NARROW, EXACT, and BROAD synonyms (other synonyms include “RELATED”). These synonyms usually includes what we want, but will miss some less closely related synonyms. Looking at the tissues that are unmapped by name can help us identify any that we might want to treat differently. For example, cartilage doesn’t map to UBERON:0002418 cartilage tissue as the synonym cartilage is RELATED. It’s also useful to see that the formatting of the some names, e.g. “Fingernail (including nail plate, eponychium and hyponychium)” and “eye - vitreous humor” prevent the Ontolopy algorithm from recognising the names. We could map these to more standardised names with a dictionary and rerun the algorithm if we had no other option, but in this case we simply use the by-ontology mapping to map terms with unmappable tissue names.

Charateristics [description] Characteristics[Tissue] Mapped by ontology Mapped by name
FF ID
FF:10040-101F4 frontal lobe, adult, pool1 frontal lobe frontal cortex frontal lobe
FF:10055-101H1 uterus, fetal, donor1 uterus embryonic uterus uterus
FF:10075-102A3 lung, right lower lobe, adult, donor1 lung lower lobe of right lung lung

Fig. 6.6 Comparisons of sample mappings by ontology (inlcluding CL) and by name (by tissue column of samples information file), illustrating the tendency of by-name mappings to be less precise.¶

As Table 6.1 shows, mapping by name function is quite slow, taking 8.99 seconds. The mappings coverage is 1263 (69.55%).

This measure doesn’t show that the mapping is constrained to be less precise than the mapping via ontology - of course this is particular to the data as it depends on how the tissues were labelled. In this case, FANTOM5 labelled the tissues fairly broadly, so we see examples like FF:10075-102A3 (lung, right lower lobe) and FF:14331-155F2 (Fibroblast - Aortic Adventitial) in Fig. 6.6 - this is particularly common when the samples are cell types.

# Examples of things that would be unmappable by text alone "Anatomical system"
tissue_col = 'Characteristics[Tissue]'

anatomical_system = samples_info.loc[samples_info[samples_info[tissue_col]=='ANATOMICAL SYSTEM'].index.intersection(mapped_cl.index)][[desc_col, tissue_col]]

anatomical_system['Mapped by ontology'] = [merged[x]['name'] for x in mapped_cl.loc[anatomical_system.index]['to']]
anatomical_system['Mapped by name'] = [merged[x]['name'] for x in mapped_by_tissue_name.loc[anatomical_system.index]['to']]

illustrative_samples = ['FF:11966-126D4', 
                        'FF:11941-126A6', 
                        'FF:11937-126A2', 
                        'FF:11936-126A1', 
                        'FF:11930-125I4', 
                        'FF:11927-125I1']

examples_rows = anatomical_system.loc[illustrative_samples]
glue("anatomical-system-examples-table", examples_rows, False)

In addition, there are samples that are not usefully mapped at all (or completely missing a mapping) using the by-name approach, that we can get by-ontology. One subset of these is the tissues which are labelled “ANATOMICAL SYSTEM”: these map to the Uberon term anatomical system, but this is too general to be useful. Fig. 6.7 shows how these terms can be mapped to more localised terms

This also shows the one limitation of mapping by ontology, in that we might expect that samples such as FF:11936-126A1 to be mapped to the more specific (and useful) UBERON:0001997 olfactory epithelium. This is not the case since there is a “missing” annotation between CL:0002167 olfactory epithelial cell and olfactory epithelium, since the definition of olfactory epithelial cell is still under discussion and development.

Charateristics [description] Characteristics[Tissue] Mapped by ontology Mapped by name
from
FF:11966-126D4 Smooth muscle cells - airway, control, donor1 ANATOMICAL SYSTEM respiratory system smooth muscle anatomical system
FF:11941-126A6 Mast cell, expanded, donor8 ANATOMICAL SYSTEM immune system anatomical system
FF:11937-126A2 gamma delta positive T cells, donor1 ANATOMICAL SYSTEM immune system anatomical system
FF:11936-126A1 Olfactory epithelial cells, donor4 ANATOMICAL SYSTEM epithelium anatomical system
FF:11930-125I4 Mallassez-derived cells, donor3 ANATOMICAL SYSTEM jaw region anatomical system
FF:11927-125I1 Fibroblast - Gingival, donor9 (control) ANATOMICAL SYSTEM gingiva anatomical system

Fig. 6.7 An example of ANATOMICAL SYSTEM tissue samples, with tissue-specific cells, which are unmapped by-name, but have useful by-ontology mappings.¶

There are also some benefits to the name based mapping. A quirk of the ontology-based mapping is that many cell types are identified as having being part of the immune system, however, this isn’t a well-defined locality. It’s possible for samples from different physical locations (e.g. liver, blood) to map to the immune system term. For the FANTOM5 data at least, the names better describe locations that these samples came from.

6.5.1.4. Combining mappings¶

To get the best of both mappings, we need to combine them using the Uberon.get_overall_tissue_mappings function. This function creates both an overall mapping and a list of disagreements. Where only one mapping covers a term, it is trivial to do this (the overall mapping uses the present mapping, and there are no disagreements). When both mappings are present and one term is an ancestor of another, we say there is no disagreement and choose the more specific mapping e.g. if mapping by ontology gives us photoreceptor array, but mapping by name gives us eye, then because photoreceptor array is part_of eye and eye is_a sense organ, we would use the overall mapping by ontology since photoreceptor array is the more specific term. When both mappings are present and there is no relationship between them, this is when we say there is a disagreement, and we can choose which mapping we give precedence to, by default it is the ontology mapping.

Before we create the overall mapping, we will remove the immune system by-ontology mappings, for the reasons discussed above.

# Remove immune system mappings
immune_system = 'UBERON:0002405'
for ff in sample_to_tissue_mapping[(sample_to_tissue_mapping['to'] == immune_system) & sample_to_tissue_mapping.index.isin(mapped_by_tissue_name[~mapped_by_tissue_name['to'].isna()].index)].index:
    sample_to_tissue_mapping.loc[ff] = [np.nan, np.nan, np.nan]

# Combine mappings
overall, disagreements = merged.get_overall_tissue_mappings(
    map_by_ont=sample_to_tissue_mapping, 
    map_by_name=mapped_by_tissue_name,
    rel = ['is_a', 'part_of', 'continuous_with', 'has_potential_to_develop_into', 'develops_into']
)

unmapped_overall = overall[overall['mapped_by'].isna()]
mapped_overall = overall[~overall['mapped_by'].isna()]
glue('overall-unique-tissues', len(overall['overall'].unique()), False)

# Add overall to table
def format_mapped(mapped, total):
    return f"{len(mapped)} ({100*len(mapped)/float(total):.2f}%)"

glue("mapped-overall", format_mapped(mapped_overall, total), False)
glue("unmapped-overall", format_mapped(unmapped_overall, total), False)

The FANTOM5 data contains different categories of samples including tissues, time courses, immortal cell lines, fractionations and purturbations, and primary cells. Some of these categories might not map in the way that we might want them to because although they might be a cell type that is usually localised to a tissue, they are unusual since they represent unusual in-between developing tissues (e.g. stem cells) or cancerous immortal cell lines. This is likely to have led to uncertainties in the sample ontology file, so by restricting to primary cell and tissue samples, we might get a more accurate picture of the percentage of mappable samples that Ontolopy can reach.

# Glue all values for the primary cell and tissues version of the table `coverage-tissue-primary-cell`
def primary_and_tissues(mapped, unmapped, samples_info, name_string):
    """Removes non primary cell and tissue samples from the `mapped` and `unmapped` samples."""
    category_col = 'Characteristics [Category]'
    categories = ['primary cells', 'tissues']
        
    category_samples = samples_info[samples_info[category_col].isin(categories)].index
    total = len(category_samples)
    
    new_mapped = mapped[mapped.index.isin(category_samples)]
    new_unmapped = unmapped[unmapped.index.isin(category_samples)]
    
    glue(f"{name_string}_mapped_pt", format_mapped(new_mapped, total), False)
    glue(f"{name_string}_unmapped_pt", format_mapped(new_unmapped, total), False)

    return new_mapped, new_unmapped

mapped_nocl_pt, unmapped_nocl_pt = primary_and_tissues(mapped_nocl, unmapped_nocl, samples_info, 'nocl')
mapped_cl_pt, unmapped_cl_pt = primary_and_tissues(mapped_cl, unmapped_cl, samples_info, 'cl')
mapped_name_pt, unmapped_name_pt = primary_and_tissues(mapped_name, unmapped_name, samples_info, 'name')
mapped_overall_pt, unmapped_overall_pt = primary_and_tissues(mapped_overall, unmapped_overall, samples_info, 'overall')
Table 6.2 Table showing the difference in coverage for different mappings, when restricting the samples to tissue and primary cell samples.¶

Mapping name

Number (and percentage) of primary cell and tissue mapped samples

Number (and percentage) of primary cell and tissue unmapped samples

By ontology: using Uberon tissues only

220 (29.57%)

524 (70.43%)

By ontology: using Uberon tissues and CL cells

700 (94.09%)

44 (5.91%)

By name: using tissue column

652 (87.63%)

92 (12.37%)

Overall mapping (combining by-ontology with CL, and by-name mappings)

739 (99.33%)

5 (0.67%)

sex_col = 'Characteristics [Sex]'
age_col = 'Characteristics [Age]'
remaining_unmapped = samples_info.loc[unmapped_overall_pt.index][[desc_col, sex_col, age_col, tissue_col]]
glue("unmapped-table", remaining_unmapped)

Table 6.2 shows that the missing mapping seen in Table 6.1 can be explained by the presence of sample types such as developing tissues and immortal cell lines (models for diseases), i.e. not healthy adult tissues. Again there was a benefit in combining mappings. The only remaining unmapped tissues were:

  1. unclassifiable reference RNA samples (from different providers) - shown in Fig. 6.8, from mixed donors and cell types: this is reassuring as we would hope that these would not be mapped to a tissue.

  2. Two Buffy coat sample of reticulocytes. We could map these by hand to the UBERON “blood” term.

Charateristics [description] Characteristics [Sex] Characteristics [Age] Characteristics[Tissue]
sample_id
FF:10000-101A1 Clontech Human Universal Reference Total RNA, ... mixed NaN unclassifiable
FF:10002-101A5 SABiosciences XpressRef Human Universal Total ... mixed NaN unclassifiable
FF:10007-101B4 Universal RNA - Human Normal Tissues Biochain,... mixed NaN unclassifiable
FF:11931-125I5 CD34 cells differentiated to erythrocyte linea... NaN NaN Buffy coat
FF:11932-125I6 CD34 cells differentiated to erythrocyte linea... NaN NaN Buffy coat

Fig. 6.8 Table showing the remaining samples which could not be automatically mapped to an Uberon tissue using Ontolopy. All three are reference RNA samples.¶

# Create disagreements table
category_col = 'Characteristics [Category]'
categories = ['primary cells', 'tissues']

category_samples = samples_info[samples_info[category_col].isin(categories)].index

disagreements_pt = disagreements[disagreements.index.isin(category_samples)]
disagreements_pt.loc[:,'by name text'] = pd.Series([merged[x]['name'] for x in disagreements_pt['by_name']], disagreements_pt.index)
disagreements_pt.loc[:,'by ont text'] = pd.Series([merged[x]['name'] for x in disagreements_pt['by_ont']], disagreements_pt.index)

disagreements_table = disagreements_pt.drop_duplicates(subset=['by name text','by ont text'])[['by name text', 'by ont text']]
disagreements_table.loc[:,'description'] = pd.Series(samples_info[desc_col].loc[disagreements_table.index], index=disagreements_table.index)

glue('number-disagreements', len(disagreements_pt), False)
glue('number-types-disagreements', len(disagreements_table), False)
disagreements_table = disagreements_table[['description','by name text', 'by ont text']]
glue('disagreements-table', disagreements_table)

6.5.1.5. Finding inconsistencies¶

By comparing the results of both ontology and text based searches, Ontolopy can find inconsistencies between the two representations which sign post to issues with samples data and how it is presented, or in the ontologies that it is linked to (in this case Uberon and CL): I give an example of each type. I found this approach very useful, as it allowed me to feed back my discoveries to the maintainers of these ontologies and datasets in order to improve them, and has resulted in improvements to several of these resources.

There were two main ways in which inconsistencies were found:

  1. Through looking at samples which are not mapped by one method or another.

  2. By looking at the disagreements output which compares the mapping that Ontolopy finds using one file and method (text in sample information file), to that it finds using the other (terms in sample ontology file).

For the FANTOM5 data, disagreements between these mappings revealed problems in the biological ontologies and experiment metadata that were provided to the package in order to create the mappings. These discrepancies may be a lack of specificity, incompleteness in, or disagreement between FANTOM, CL, or Uberon annotations, either in creating ontologies or annotating tissues to samples. The process of mapping FANTOM to Uberon tissues found twenty-two such disagreements, of which FANTOM, Uberon, and CL where appropriate have been informed via GitHub issues, some of which have already sparked changes in the ontologies.

Four different types of example are described below, to give an idea of how multiple mappings may be used to improve annotation.

A full list of disagreements can be seen in Fig. 6.9. There were 32 disagreements/inconsistencies found using Ontolopy. These disagreements can affect multiple (replicate) samples, for a total of 96 samples.

description by name text by ont text
sample_id
FF:10277-104E7 optic nerve, donor1 neuron projection bundle connecting eye with b... cranial nerve II
FF:11207-116A1 Endothelial Cells - Aortic, donor0 aorta artery
FF:11216-116B1 Urothelial cells, donor0 urinary bladder urothelium
FF:11219-116B4 Mesenchymal Stem Cells - Vertebral, donor1 spinal cord vertebra
FF:11220-116B5 Sebocyte, donor1 zone of skin skin sebaceous gland
FF:11234-116D1 Smooth Muscle Cells - Brain Vascular, donor1 brain vasculature
FF:11242-116D9 Ciliary Epithelial Cells, donor1 camera-type eye epithelium
FF:11248-116E6 Anulus Pulposus Cell, donor1 spinal cord annulus fibrosus disci intervertebralis
FF:11252-116F1 Nucleus Pulposus Cell, donor1 spinal cord nucleus pulposus
FF:11266-116G6 Endothelial Cells - Thoracic, donor1 internal thoracic artery thoracic aorta
FF:11269-116G9 Fibroblast - Dermal, donor1 zone of skin dermis
FF:11271-116H2 Hair Follicle Dermal Papilla Cells, donor1 hair follicle dermal papilla
FF:11272-116H3 Keratinocyte - epidermal, donor1 zone of skin skin epidermis
FF:11273-116H4 Mammary Epithelial Cell, donor1 breast mammary gland
FF:11279-116I1 Preadipocyte - subcutaneous, donor1 adipose tissue hypodermis
FF:11280-116I2 Preadipocyte - visceral, donor1 heart connective tissue
FF:11291-117A4 Synoviocyte, donor1 synovial membrane of synovial tendon sheath synovial membrane of synovial joint
FF:11393-118C7 Endothelial Cells - Lymphatic, donor3 capillary lymphatic vessel
FF:11453-119A4 Bronchial Epithelial Cell, donor4 lung bronchus
FF:11469-119C2 Preadipocyte - perirenal, donor1 kidney perirenal fat
FF:11493-119E8 Meningeal Cells, donor1 meningeal cluster blood-cerebrospinal fluid barrier
FF:11499-119F5 Perineurial Cells, donor1 spinal cord perineurium
FF:11513-119H1 Smooth Muscle Cells - Tracheal, donor1 lung trachea
FF:11518-119H6 Renal Mesangial Cells, donor1 kidney connective tissue
FF:11535-120A5 Fibroblast - Villous Mesenchymal, donor1 trophoblast placenta
FF:11590-120G6 Alveolar Epithelial Cells, donor2 lung renal glomerulus
FF:11752-123G6 mesenchymal precursor cell - cardiac, donor1 heart mesenchyme
FF:11758-123H3 mesenchymal precursor cell - ovarian cancer me... ovary connective tissue
FF:11842-124H6 mesenchymal precursor cell - ovarian cancer ri... bone marrow right ovary
FF:11933-125I7 Olfactory epithelial cells, donor1 anatomical system epithelium
FF:12226-129F3 nasal epithelial cells, donor1 nasal cavity epithelium
FF:12238-129G6 chorionic membrane cells, donor1 chorion membrane egg chorion

Fig. 6.9 Table showing all types of disagreements found using Ontolopy, with example samples.¶

6.5.1.5.1. Finding samples that are missing annotations to tissues¶

When we look at the samples that we would expect to map by ontology, but that don’t, after filtering for tissues and primary cells only, we see that there are just two types of samples:

  1. One sample FF:10379-105H1 which is missing is_a: FF:0010164 ! human caudate nucleus - adult donor sample in the FANTOM5 ontology file

  2. 21 T-cell samples, all of which appear not to have been fully classified (i.e. contain the following line in the ontology file comment: Changed from previous label. TODO: full classification). I could map all of them to the term for T-cell, whereas someone with more knowledge of T-cells could more accurately map these samples to more specific cell types.

I can use Ontolopy to add these mappings to the merged ontology to improve the by ontology mapping if I needed to: this would help me to find additional mappings, for example, to immune system as well as blood.

6.5.1.5.2. Missing Uberon or CL annotation¶

Example: Missing annotation Bronchus part_of some Lung

One type of problem that can be revealed is a missing link in an ontology.

An example of this that was found using the FANTOM data set was that there was no formal relation in the Uberon ontology between Bronchus and Lung, despite the fact that the description text for Bronchus says “the upper conducting airways of the lung”.

This was found because the sample FF:11511-119G8 (Bronchial Epithelial Cell, donor1) is mapped by name to UBERON:0002048 Lung, but by ontology to UBERON:0002185 Bronchus. This was flagged as inconsistent because there are no relations in the Uberon ontology between these terms.

Similar missing annotations were discovered between Aorta and Artery, Hair follicle and Dermal papilla, and Skeletal muscle myoblast and Skeletal muscle fiber, and Trophoblast and Placenta.

6.5.1.5.3. Mislabelled sample¶

Sometimes samples are simply mislabelled, this can happen in any file type.

Example: FF:11590-120G6 should be labelled Alveolar Epithelial Cells not Renal Glomerular Endothelial Cells

The FANTOM sample ontology file contains two samples named Renal Glomerular Endothelial Cells, donor2: FF:11590-120G6 and FF:11594-120H1. One of these is a mislabelled sample, and it is actually an Alveolar Epithelial Cell sample. The mistake is only for the name in the FANTOM ontology file, but not the tissue annotation.

Example: FF:11842-124H6 should be labelled ovary not bone marrow in the samples information file The tissue column of the samples information file lists sample FF:11842-124H6 as a bone marrow sample, despite being an ovarian cancer sample.

6.5.1.5.4. Imprecise annotation to tissue¶

Example: Nucleus pulpopus as Spinal cord

Several FANTOM5 tissues are labelled by name colloquially, rather than precisely. For example, both Nucleus pulpopus and Vertebra are labelled Spinal cord (although the spinal cord itself is considered disjoint from these entities by definition, and in the Uberon ontology). It’s for this reason that the ontology mapping is preferred over the labelled sample name in creating the overall FANTOM sample-to-tissue mapping.

Example2: FF:11423-118G1 is_a dermal melanocyte

Sometimes the text in the samples information file can help us to reach better mappings in the sample ontology file. For example sample FF:11423-118G1 (and five other similar samples) are mapped to CL:0000148 (melanocyte), which is a cell that can come from many different parts of the body (skin, heart, eyes, etc), so Ontolopy can only map this term to several tissues (some of which this cell will not have come from) and only if the child mapping functionality is used. However, since the sample was labelled as coming from the “skin”, it’s clear that this sample would have been better annotated to CL:0002482 (dermal melanocyte).

6.5.1.6. Mapping overview¶

Using Ontolopy we can get a coverage of all samples that we would expect to map to a localised tissue (defining this as primary cell and tissue samples excluding reference RNA). These mappings correspond to 157 unique tissues.

6.5.2. Creating tissue-to-phenotype mappings¶

by-ontology mapping

Here we are only using an ontology based mapping, but if we had information in the samples information file about phenotype (e.g. disease), we could also use this to do an additional name based mapping if we wanted to.

The approach to the creation of the tissue-to-phenotype mappings is different to that we just took for sample-to-tissue mappings in that we are only doing a by-ontology mapping, rather than also mapping by-name and then comparing. However, it is also a more complex example of a by-ontology mapping since we are asking more than one question to the ontology and adding them together. For all these questions, we start with the 157 tissues that we are interested in finding mappings for as source terms, and we use opy.Relations’s mode='all' option to find all of the Gene Ontology targets=['GO'] terms that are related to them.

We are interested broadly in tissues where a phenotype can take place, so this could be something on the level of proteins (calcium signalling), cells (cell motility), or tissue (protein secretion). This will affect what settings (particularly allowed_relations) we use when we make calls to Relations.

Only Gene Ontology Biological Process terms are related to phenotypes. The quickest way to retrieve only these is to ask for all GO terms and then filter them afterwards. After loading the GO basic ontology, we can easily retrieve a list of Biological Process terms.

go_obo_file = '../c08-combining/data/go-basic.obo' 
go_obo = opy.load_obo(
    file_loc=go_obo_file, 
    ont_ids=['GO', 'UBERON','CL','FMA'], 
)

biological_processes = [x for x in go_obo.terms if ('biological_process' in go_obo[x]['namespace'])]

6.5.2.1. Propagating relationships up the tree using part_of¶

Our first example of looking for relations between tissues and phenotypes will include the part_of relation. Since ontologies are often represented by DAGs, relationships are usually generally in one direction. While there is also the has_part relationship that we will look at shortly, part_of is preferred in the Uberon ontology with almost 10 times as many instances (15,486 compared to 1,703).

We first combine the GO ontology with the uberon ontology, which will simply help us to be able to look up the names of the GO terms to present the output in a more accessible format. This doesn’t make a difference to the number of mappings, only to the relation_text field of the output (which will contain names instead of GO term IDs if available).

We then use the opy.Relations class with mode='all', and allowed_relations including is_a, part_of (as mentioned), and some relationships which typically define relationships between tissues and phenotypes is_model_for, capable_of, capable_of_part_of, and the GO relation which is defined by Ontology to capture references to GO terms within definitions. This retrieves the mappings in 0.07 seconds.

The Relations class returns a dataframe with the same format whether you use the default mode (finding any mapping that looks like the targets) or the all mode; both are indexed by source terms. For all mode, however, there can be multiple mappings for each source term, so the dataframe contains lists of mappings. This dataframe isn’t too easy on the eyes (or analysis), so Ontolopy also has a helpful method called format_all which reformats the Relations output dataframe when the all mode is used into an easier-to-work-with multi-indexed dataframe. Example output of this can be seen in Fig. 6.10.

# Merge GO obo:
go_uberon = uberon_obo.merge(go_obo)

# Retrieve tissue-phenotype mapping:
start = time.time()
source_tissues = overall['overall'].unique()[1:]
relations = ['GO', 'is_a', 'is_model_for', 'part_of', 'capable_of', 'capable_of_part_of']
phenotype_mapping = opy.Relations(
    allowed_relations=relations, 
    ont=go_uberon, 
    sources=source_tissues, 
    targets=['GO'], 
    mode='all'
)
glue("time-tissue-up", f"{(time.time()-start):.2f} seconds", True)

# Create formatted version (easy to work with):
formatted_phenotype_mapping = phenotype_mapping.format_all(ont=go_uberon, targets=['GO'])
formatted_phenotype_mapping.index.set_names(['Tissue', 'Phenotype'], inplace=True)
formatted_phenotype_mapping = formatted_phenotype_mapping[formatted_phenotype_mapping.index.get_level_values('Phenotype').isin(biological_processes)]

# Create table view
chosen_rows = [('UBERON:0001255', 'GO:0048731'), # urinary bladder
               ('UBERON:0001255', 'GO:0007275'),
               ('UBERON:0001255', 'GO:0048856'),
               ('UBERON:0001255', 'GO:0032502'),
               ('UBERON:0001255', 'GO:0008015'),
               ('UBERON:0000955', 'GO:0050890'), # brain
               ('UBERON:0000955', 'GO:0048856'),
               ('UBERON:0000955', 'GO:0007275'),
               ('UBERON:0000955', 'GO:0021551')]
phenotype_mapping_table = formatted_phenotype_mapping[['relation_text']].loc[chosen_rows]
pd.set_option("display.max_colwidth", 600)
glue('phenotype-mapping-table', phenotype_mapping_table)
pd.reset_option("display.max_colwidth")
relation_text
Tissue Phenotype
UBERON:0001255 GO:0048731 urinary bladder part of lower urinary tract part of renal system GO renal system development is a system development
GO:0007275 urinary bladder part of lower urinary tract part of renal system GO renal system development is a system development part of multicellular organism development
GO:0048856 urinary bladder part of lower urinary tract part of renal system GO renal system development is a system development is a anatomical structure development
GO:0032502 urinary bladder part of lower urinary tract part of renal system GO renal system development is a system development is a anatomical structure development is a developmental process
GO:0008015 urinary bladder part of lower urinary tract part of renal system GO renal system process involved in regulation of blood volume is a renal system process involved in regulation of systemic arterial blood pressure part of regulation of systemic arterial blood pressure is a regulation of blood pressure part of blood circulation
UBERON:0000955 GO:0050890 brain capable of cognition
GO:0048856 brain is a organ part of anatomical system GO system development is a anatomical structure development
GO:0007275 brain is a organ part of anatomical system GO system development part of multicellular organism development
GO:0021551 brain part of central nervous system GO central nervous system morphogenesis

Fig. 6.10 Table showing a view of the output of Relations.format_all for the tissue-to-phenotype mapping.¶

As Fig. 6.10 shows, mappings contain a mixture of mappings to specific GO terms like brain and cognition, and very general phenotype terms, like urinary bladder and anatomical structure development.

# Calculate number of mapped GO terms:
go_counts = pd.DataFrame(formatted_phenotype_mapping.index.get_level_values(1).value_counts())
go_counts.columns = ['Frequency']
name = []
for go in go_counts.index:
    try:
        name.append(go_uberon[go]['name'])
    except KeyError:
        name.append(None)
go_counts.loc[:, 'name'] = pd.Series(name, go_counts.index)
go_counts_table = go_counts.head(20)
glue("go-counts-table", go_counts_table)

to_remove = go_counts_table.iloc[:9]['name'].to_list() + ['cellular process']

glue("num-go-to-remove", len(to_remove))
glue("lst-go-to-remove", list_to_text(to_remove))
glue("num-go-mapped", len(go_counts))

As we’ve seen in other example use cases, it’s possible to use the exclude option when retreiving the mapping, to exclude any terms that you might wish to avoid, for example very general terms if you have a list of these. Since we didn’t know this, we found the 20 most frequently mapped GO terms (seen in Fig. 6.11), out of 260 mapped overall. From this list 10 tissues to remove were then manually identified: developmental process, biological_process, anatomical structure development, multicellular organismal process, multicellular organism development, system development, system process, single-organism process, single-organism developmental process, and cellular process.

Frequency name
GO:0032502 142 developmental process
GO:0008150 142 biological_process
GO:0048856 141 anatomical structure development
GO:0032501 141 multicellular organismal process
GO:0007275 141 multicellular organism development
GO:0048731 140 system development
GO:0003008 108 system process
GO:0044699 64 single-organism process
GO:0044767 63 single-organism developmental process
GO:0050877 49 nervous system process
GO:0009653 46 anatomical structure morphogenesis
GO:0007399 35 nervous system development
GO:0048513 35 animal organ development
GO:0007417 34 central nervous system development
GO:0021551 34 central nervous system morphogenesis
GO:0050890 30 cognition
GO:0003013 25 circulatory system process
GO:0007586 24 digestion
GO:0022600 24 digestive system process
GO:0008015 23 blood circulation

Fig. 6.11 Table showing the frequency that phenotype (GO) terms are mapped to the provided tissue terms, for the top 20 GO terms.¶

# Calculate number of mappings per UBERON term
def mapping_to_tissue_counts(df):
    """
    Input formatted df
    """
    counts = df.groupby(level=['Tissue']).size()
    counts = pd.DataFrame(counts, columns = ['Number mapped phenotypes'])
    
    # Add zeros to counts
    unmapped_tissue_phenotype = []
    for tissue in source_tissues:
        if not tissue in counts.index:
            counts.loc[tissue] = [0]
            unmapped_tissue_phenotype.append(tissue)
            
    return counts, unmapped_tissue_phenotype

counts, unmapped_tissue_phenotype = mapping_to_tissue_counts(formatted_phenotype_mapping)

glue("unmapped-tissue-up", len(unmapped_tissue_phenotype))
glue("lst-unmapped-tissue-phenotype", list_to_text([merged[tissue]['name'] for tissue in unmapped_tissue_phenotype]))

# remove to_remove and then recalculate:
to_remove_ids = list(go_counts_table.iloc[:9].index) + ['GO:0009987']
formatted_phenotype_mapping_filtered = formatted_phenotype_mapping.drop(to_remove_ids, level='Phenotype')
counts_less, unmapped_tissue_phenotype_less = mapping_to_tissue_counts(formatted_phenotype_mapping_filtered)

num_mapped_up = len(formatted_phenotype_mapping_filtered.index.get_level_values("Tissue").unique())
glue("mapped-tissue-up", f"{num_mapped_up} ({100*num_mapped_up/float(len(source_tissues)):.2f}%)")
# glue("mapped-tissue-up", len(formatted_phenotype_mapping_filtered.index.get_level_values("Tissue").unique()))
glue("num-unique-phen-up",  len(formatted_phenotype_mapping_filtered.index.get_level_values("Phenotype").unique()))

glue("num-unmapped-tissue-phenotype-num", len(unmapped_tissue_phenotype_less))
glue("lst-unmapped-tissue-phenotype-lst", list_to_text([merged[tissue]['name'] for tissue in unmapped_tissue_phenotype_less]))

Fig. 6.13 (b) shows us that after removing very general terms, the majority of terms have 1-20 phenotypes mapped to them.

A small number have more, and a small number have no mappings. There are 24 terms in (b) which do not have a mapping except for the very general terms. These terms are: adipose tissue, pancreas, blood, umbilical cord, throat, zone of skin, breast, skin of palm of manus, cerebrospinal fluid, anatomical system, epithelium, pelvic region of trunk, skin of body, retroperitoneal space, connective tissue, thoracic segment of trunk, neck, mediastinum, omentum, perirenal fat, amnion, chorion membrane, insect adult prothoracic segment, and insect adult mesothoracic segment. Clearly there are phenotypes that affect these tissues (with the exception of the obsolete term), so the lack of mapping here may represent missing relationships or terms within the gene ontology. An important one for our data set is blood (since we have many such tissue samples): there are GO phenotype terms relating to blood such as blood circulation and blood coagulation, so why don’t we get mappings to these terms?

6.5.2.2. Propagating “down” the tree: has_part¶

The problem above happens because the annotation to these phenotype terms is not carried out at the level of tissue (blood) but at the level of cell type (blood cell). In order to retrieve these terms, instead of propagating up the tree (to more general terms) we need to look down the tree (to more specific terms).

has_part and part_of

Ontological relations have strict definitions which help allow us to reason based on these relations, for example by A has_part B, we mean that A always has B as a part, while B part_of A means that whenever B exists it is part of A.

This means that has_part and part_of are not inverses[187], i.e. if A part_of B, that does not necessarily mean that B has_part A. For example, wherever human ovaries exist, they are part_of humans, but whever humans exist, they don’t necessarily has_part human ovaries.

There are also other terms which denote similar relations, such as composed_primarily_of.

One way is to use Relations including relations in allowed_relations that denote having something as a part: has_part and composed_primarily_of. From here onwards, I’ll use has_part as a shorthand for both of these terms. It doesn’t make sense to run Relations with both has_part and part_of, since by running both up and down the tree, it could lead to technically true, but uninteresting and potentially misleading mappings. A simple fictional example of this would be mapping little toe and big toe nail by finding the relation little toe part of toes has part big toe has part big toe nail.

Including phenotypes that are only relevant for part of the tissue makes sense for tissue samples like blood where we have all parts of the blood in our sample (e.g. the sample will certainly contain blood cells and plasma). However, they may make less sense for a tissue sample like heart, where we don’t know if the sample came from the right ventricle or the left ventricle and there may be phenotype terms which are specific to a part of the anatomy we didn’t sample from. With this in mind, our choices are:

  1. don’t include has_part relations, and miss phenotype mappings that are made at the level of constituent parts

  2. include has_part relations, but be aware that some samples may map to phenotypes that they are not capable of if the sample-to-tissue mapping is not specific enough.

In our case, option (2) is preferable, particularly because the FANTOM5 dataset contains many blood samples, and otherwise we would be missing phenotype mappings entirely for these samples. Running this is otherwise very similar.

# Retrieve tissue-phenotype mapping (propagating down using "has_part"):
# TODO: Add test to Ontolopy checking that the order of checking the new terms doesn't matter - add issue
start = time.time()
relations = ['GO', 'is_a', 'is_model_for', 'has_part', 'capable_of', 'capable_of_part_of', 'channel_for', 'composed_primarily_of']
phenotype_mapping_down = opy.Relations(
    allowed_relations=relations, 
    ont=go_uberon, 
    sources=source_tissues, 
    targets=['GO'], 
    mode='all'
)
glue("time-tissue-down-1", f"{(time.time()-start):.2f} seconds", True)

# Format and calculate counts of phenotypes per tissue:
formatted_phenotype_mapping_down = phenotype_mapping_down.format_all(ont=go_uberon, targets=['GO'])
formatted_phenotype_mapping_down.index.set_names(['Tissue', 'Phenotype'], inplace=True)
formatted_phenotype_mapping_down = formatted_phenotype_mapping_down[formatted_phenotype_mapping_down.index.get_level_values('Phenotype').isin(biological_processes)]
formatted_phenotype_mapping_down_filtered = formatted_phenotype_mapping_down.drop(list(set(to_remove_ids) & set(formatted_phenotype_mapping_down.index.get_level_values('Phenotype'))), level='Phenotype')

glue("num-unique-phen-down-1", len(formatted_phenotype_mapping_down_filtered.index.get_level_values('Phenotype').unique()))
num_mapped_down1 = len(formatted_phenotype_mapping_down_filtered.index.get_level_values("Tissue").unique())
glue("mapped-tissue-down-1", f"{num_mapped_down1} ({100*num_mapped_down1/float(len(source_tissues)):.2f}%)")

counts_down1, _ = mapping_to_tissue_counts(formatted_phenotype_mapping_down)
counts_down1_less, unmapped_down1_less = mapping_to_tissue_counts(formatted_phenotype_mapping_down_filtered)

# Combine to find as yet unmapped 
yet_unmapped_down1 = list(set(unmapped_tissue_phenotype_less) & set(unmapped_down1_less))
glue("unmapped-tissue-down-1", len(yet_unmapped_down1))
glue("lst-unmapped-updown-less", list_to_text([merged[x]['name'] for x in yet_unmapped_down1]))

In Table 6.3, which compares the number of mapped tissues and phenotypes for different tissue-to-phenotype mapping methods, we can see that this method maps more phenotypes, but for less tissues.

Although less tissues have been mapped overall, we can tell they do capture previously unmapped tissues since the overall number of unmapped tissues (after removal of too-general terms) reduces from 24 with propagating up only to 10 which aren’t mapped by either method. While propagating down therefore improves the overall mapping coverage, looking at the tissues which remain unmapped gives us a clue as to what further improvements we can make.

The terms which remain unmapped are connective tissue, epithelium, adipose tissue, cerebrospinal fluid, mediastinum, umbilical cord, perirenal fat, retroperitoneal space, omentum, and anatomical system. The give-away term is skin of body, since looking at subfigure (d) in Fig. 6.13, by hovering over the top most well-mapped tissue we can see that it is stratum basale of epidermis, which is part of the epidermis which is in turn part of the skin of body. The reason skin of body doesn’t have a mapping is because while the epidermis is part_of the skin of body, the skin of body does not have the has_part relation to epidermis.

6.5.2.3. Propagating down the tree: inverse of part_of¶

In addition to the has_part approach, we could use the inverse of the part_of relations, however the definitions of these terms mean that the inverse of part_of means something like can have part. This would mean that in addition to the risk of potentially including mapping to more specific parts of the body that weren’t in our sample (as we discussed for the has_part approach), we might sometimes include mappings to tissues that were not even present in the species or gender from which the sample came. The species problem is the much more pressing concern since Uberon is a multi-species ontology containing many non-human-specific terms and therefore we could end up mapping human samples to terms like GO:0035844 cloaca development.

One solution to this in Ontolopy is to define relations that look something like A can_have_human_part B from the information in the ontology files, by using the inverse of part_of relations only where there is an external reference (xref) to the FMA human anatomy ontology. We need to do this semi-manually as Ontolopy does not currently contain tools for automatically defining new relations. Once we’ve created this new relationship, we can ask for relations including it in the same query as has_part. One downside of this approach is that relations found using this kind of self-defined relation will not be able to use the simple reasoning that Ontolopy is capable of (i.e. collapsing relations by using definitions like is_a\(\cdot\)part_of == part_of, since such equivalences are not defined).

Sex-specific phenotype mappings

We could also create a relation like A can_have_part_in_female B (and an analagous term for male) when B part_of A and B part_of UBERON:0003100 female organism. We could then cross-reference the sex of our samples from the sample information file to ensure that we don’t create mappings between e.g. male-only samples and ovaries. This isn’t done here, since it will only affect a very small number of mappings (given that many of the samples are mixed/unknown sexes, that there are only a small number of sexual dimorphic tissues, and that these are generally mapped at the level of specific sexes already), and wouldn’t illustrate a different aspect of using Ontolopy. Such mappings, if they exist, are simply not included. If they are excluded, it means that we simply do not map non sex-specific tissues like gonad to either testes or ovary-related phenotypes, so we might be missing such mappings.

We could also do the same for sex-specific tissue-phenotype mappings, should we want to. This could be useful, depending on the experiment data in question and the resulting sample-tissue mappings we’d previously attained.

# Create new ontology term 'can_have_human_part' as inverse of 'part_of' with 'xref'=='FMA':
for term in go_uberon.terms:
    if term.split(':')[0] == 'UBERON':
        try:
            parts_of = go_uberon[term]['part_of']
        except KeyError:
            parts_of = []
            
        try:
            xrefs = go_uberon[term]['xref']
        except:
            xrefs = []
        
        if not any([xref.startswith('FMA') for xref in xrefs]):  # not human, e.g. cloaca
            continue
        
        for part in parts_of:
            if part.split(':')[0] == 'UBERON':
                try:
                    go_uberon[part]['can_have_human_part'].add(term)
                except:
                    go_uberon[part]['can_have_human_part'] = {term}

# Create mapping:
start = time.time()
phenotype_mapping_down2 = opy.Relations(
    allowed_relations=relations+['can_have_human_part'], 
    ont=go_uberon, 
    sources=source_tissues, 
    targets=['GO'], 
    mode='all'
)
glue("time-tissue-down-2", f"{(time.time()-start):.2f} seconds", True)

# Format:
formatted_phenotype_mapping_down2 = phenotype_mapping_down2.format_all(ont=go_uberon, targets=['GO'])
formatted_phenotype_mapping_down2.index.set_names(['Tissue', 'Phenotype'], inplace=True)
formatted_phenotype_mapping_down2 = formatted_phenotype_mapping_down2[formatted_phenotype_mapping_down2.index.get_level_values('Phenotype').isin(biological_processes)]
formatted_phenotype_mapping_down2_filtered = formatted_phenotype_mapping_down2.drop(list(set(to_remove_ids) & set(formatted_phenotype_mapping_down2.index.get_level_values('Phenotype'))), level='Phenotype')
num_mapped_down2 = len(formatted_phenotype_mapping_down2_filtered.index.get_level_values("Tissue").unique())
glue("mapped-tissue-down-2", f"{num_mapped_down2} ({100*num_mapped_down2/float(len(source_tissues)):.2f}%)")
glue("num-unique-phen-down-2", len(formatted_phenotype_mapping_down2_filtered.index.get_level_values('Phenotype').unique()))

# Calculate counts:
counts_down2, _ = mapping_to_tissue_counts(formatted_phenotype_mapping_down2)
counts_down2_less, unmapped_down2_less = mapping_to_tissue_counts(formatted_phenotype_mapping_down2_filtered)

yet_unmapped_down2 = list(set(unmapped_tissue_phenotype_less).intersection(set(unmapped_down1_less), set(unmapped_down2_less)))
glue("unmapped-tissue-down-2", len(yet_unmapped_down2))
glue("lst-unmapped-updown2-less", list_to_text([merged[x]['name'] for x in yet_unmapped_down2]))

assert(len(set(formatted_phenotype_mapping_down.index) - set(formatted_phenotype_mapping_down2.index))==0)

# Create examples of new relations table
examples = [2, 9, 19, 45, 59, 105]
new_samples = list(set(formatted_phenotype_mapping_down2_filtered.index)-set(formatted_phenotype_mapping_down_filtered.index))
glue('new-down2', len(new_samples))
pd.set_option("display.max_colwidth", 600)
table_down2_examples = formatted_phenotype_mapping_down2_filtered.loc[[new_samples[i] for i in examples]][['relation_text']]
glue('table-down2-examples', table_down2_examples)
pd.reset_option("display.max_colwidth")

This resulting mapping contains 4968 additional tissue-phenotype mappings (not found in either the has_part or the part_of approach). Some examples of these additional tissue-phenotype mappings that were found using this can_have_human_part approach are given in Fig. 6.12. Overall, this mapping covers 133 (85.26%) tissues and 449 phenotypes.

The yet unmapped tissues (by any method) are now: connective tissue, epithelium, cerebrospinal fluid, adipose tissue, mediastinum, umbilical cord, perirenal fat, retroperitoneal space, and omentum. While this list still contains tissues that we would expect to map to GOBP phenotypes, the lack of these terms in our searches now means that they are simply missing annotations. For example we have no mapping for adipose tissue despite the fact that GOBP terms exists for adipose tissue development and fat cell proliferation, but there is no cross-ontology mapping of these term in the current version of the ontology, so there is no way Ontolopy could pick them up.

relation_text
Tissue Phenotype
UBERON:0002022 GO:0048858 insula is a cerebral hemisphere gray matter can have human part cerebral cortex can have human part hippocampal formation can have human part hippocampus alveus is a central nervous system white matter layer composed primarily of white matter can have human part gracile fasciculus GO gracilis tract morphogenesis is a central nervous system projection neuron axonogenesis is a central nervous system neuron axonogenesis is a axonogenesis is a neuron projection morphogenesis is a cell projection morphogenesis
UBERON:0016525 GO:0019226 frontal lobe can have human part anterior segment of paracentral lobule is a regional part of brain composed primarily of neural tissue has part neuron capable of transmission of nerve impulse
UBERON:0000473 GO:2000147 testis can have human part seminal vesicle can have human part duct of seminal vesicle channel for seminal vesicle fluid is a seminal fluid capable of part of positive regulation of flagellated sperm motility is a positive regulation of cilium-dependent cell motility is a positive regulation of cell motility
UBERON:0011595 GO:0048870 jaw region can have human part tooth bud can have human part odontogenic papilla is a developing mesenchymal condensation composed primarily of mesenchyme condensation cell is a mesenchymal cell is a motile cell capable of cell motility
GO:0006897 jaw region is a organism subdivision has part external soft tissue zone has part musculature can have human part muscle tissue can have human part endomysium is a reticular tissue can have human part reticuloendothelial system composed primarily of phagocyte capable of phagocytosis is a endocytosis
UBERON:0001255 GO:0060562 urinary bladder is a viscus is a trunk region element is a organ can have human part vasculature of organ is a vasculature can have human part capillary bed is a epithelial plexus is a epithelial tube GO epithelial tube morphogenesis

Fig. 6.12 An example of six of the 4968 additional Uberon-GOBP mappings found using can_have_human_part.¶

6.5.2.4. Combining previous mappings¶

To create the final tissue-phenotype mapping, we combine the propagating up (part_of) mapping with the larger propagating down (can_have_human_part) mapping, by simply appending the new lines of the DataFrame.

‘can_have_human_part’ query completely contains ‘has_part’ query.

Because the allowed_relations for the mappings of the can_have_human_part query completely contain those for the has_part query, so do the found relations. This means that we only need to combine the part_of and can_have_human_part mappings to get the most complete set.

As we can see in Table 6.3, the overall combined mapping covers 147 (94.23%) of the Uberon tissues searched for and maps to 510 unique GOBP tissues. Since this is greater than any other individual mapping, we can see that it is necessary to combine different mapping types to get high (>90%) coverage of tissues.

# COMBINING:
# Add new mappings:
new_mappings =  set(formatted_phenotype_mapping.index) - set(formatted_phenotype_mapping_down2.index)
up_down_phenotype_mapping_formatted = formatted_phenotype_mapping_down2.copy()
for i in new_mappings:
    up_down_phenotype_mapping_formatted = up_down_phenotype_mapping_formatted.append(formatted_phenotype_mapping.loc[i])

# filter:
up_down_phenotype_mapping_filtered = up_down_phenotype_mapping_formatted.drop(list(set(to_remove_ids) & set(up_down_phenotype_mapping_formatted.index.get_level_values('Phenotype'))), level='Phenotype')
                               
# counts per phenotype:
counts_updown, _ = mapping_to_tissue_counts(up_down_phenotype_mapping_formatted)        
counts_updown_less, unmapped_updown_less = mapping_to_tissue_counts(up_down_phenotype_mapping_filtered)

# calc stats:
glue("num-unique-phen-combined", len(up_down_phenotype_mapping_filtered.index.get_level_values('Phenotype').unique()))
num_mapped_combined = len(up_down_phenotype_mapping_filtered.index.get_level_values('Tissue').unique())
glue("mapped-tissue-combined", f"{num_mapped_combined} ({100*num_mapped_combined/float(len(source_tissues)):.2f}%)")
# Create bar charts:
fig = make_subplots(rows=4, cols=2,
                    subplot_titles=(
                        "(a) all phenotype terms,<br>propagating up `part_of`", 
                        "(b) general phenotypes removed,<br>propagating up `part_of`",
                        "(c) all phenotype terms,<br>propagating down `has_part`",
                        "(d) general phenotypes removed,<br>propagating down `has_part`",
                        "(e) all phenotype terms,<br>propagating down<br>`can_have_human_part`",
                        "(f) general phenotypes removed,<br>propagating down<br>`can_have_human_part`",
                        "(g) all phenotype terms,<br>propagating both up and down",
                        "(h) general phenotypes removed,<br>propagating both up and down"
                    ),
                    shared_yaxes=True,
                   horizontal_spacing=0.05,
                   )

# ROW 1: (a) all phenotype terms propagating up
fig.add_trace(
    go_.Bar(y=counts.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index], 
           name='all, up',
           marker_color = 'slateblue',
           ),
    row=1, col=1
)
# ROW 1 (b) general phenotypes removed propagating up
fig.add_trace(
    go_.Bar(y=counts_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='filtered, up',
           marker_color = 'darkslateblue',
           ),
    row=1, col=2
)
# ROW 2: (c) all phenotype terms propagating down: has_part
fig.add_trace(
    go_.Bar(y=counts_down1.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_down1.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='all, down: has_part',
           marker_color = 'lightsalmon',
           ),
    row=2, col=1
)
# ROW 2: (d) general phenotypes removed propagating down
fig.add_trace(
    go_.Bar(y=counts_down1_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_down1_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='filtered, down: has_part',
           marker_color = 'tomato',
           ),
    row=2, col=2
)    

# ROW 3: (e) all phenotype terms, down can_have_human_part
fig.add_trace(
    go_.Bar(y=counts_down2.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_down2.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='all, down: can_have_human_part',
           marker_color = 'orchid',
           ),
    row=3, col=1
)    
# ROW 3: (f) general phenotypes removed, down can_have_human_part
fig.add_trace(
    go_.Bar(y=counts_down2_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_down2_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='filtered, down: can_have_human_part',
           marker_color = 'darkorchid',
           ),
    row=3, col=2
)    

# ROW 4: (g) all phenotype terms, propagating both up and down
fig.add_trace(
    go_.Bar(y=counts_updown.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_updown.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='all, both up and down',
           marker_color = 'palevioletred',
           ),
    row=4, col=1
)    
# ROW 4: (h) general phenotypes removed, propagating both up and down
fig.add_trace(
    go_.Bar(y=counts_updown_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'], 
           x=[f"{merged[x]['name']}, {x}" for x in counts_updown_less.sort_values(by='Number mapped phenotypes')['Number mapped phenotypes'].index],
           name='filtered, both up and down',
           marker_color = 'mediumvioletred',
           ),
    row=4, col=2
)    

fig.update_layout(showlegend=False, width=800, height=600)
fig.update_yaxes(range=[0,150], tickvals=list(range(0,151,25)))
fig.update_yaxes(title_text='Number<br>phenotypes<br>mapped to', col=1)
fig.update_annotations(font_size=12)  # subplot titles are annotations in plotly

fig.update_xaxes(showticklabels=False)
fig.update_xaxes(title_text='Tissue term', row=4)
fig.show('notebook')
../_images/blank.png

Fig. 6.13 Bar charts showing the number of tissue-phenotype mappings (number of phenotypes mapped to each tissue - roll mouse over to see tissue names) for (a) all phenotype terms propagating down, (b) general phenotypes removed propagating down, (c) all phenotype terms propagating up, (d) general phenotypes removed propagating down, (e) all phenotype terms propagating both up and down, and (f) general phenotypes removed propagating both up and down.¶

There are 9 unmapped tissues (which map to zero phenotypes), and all other tissues map to between 2 and 131 phenotype terms, (as we can see in figure Fig. 6.13). The number of mappings varies smoothly in this range with more general tissues and organs broadly appearing to have higher numbers of mappings than very specific tissues. We can also see in Fig. 6.13 that the can_have_human_part mapping makes up the majority of the mappings in the final combined mapping.

Table 6.3 Table comparing the difference in coverage (mappable tissues), phenotypes, time taken, and unmapped tissues for different mapping techniques.¶

Mapping name

Tissue coverage: number (percent) tissues mapped (this mapping only)

Number of unique phenotype mapped to (by this mapping only)

Number tissues remaining unmapped (by this or any previous mapping)

Time to retrieve mapping

Propagating up

132 (84.62%)

250

14

0.07 seconds

Propagating down using has_part

92 (58.97%)

248

10

0.07 seconds

Propagating down using can_have_human_part

133 (85.26%)

449

9

0.45 seconds

Combined (all of the above)

147 (94.23%)

510

N/A

N/A

6.5.3. Creating sample-to-tissue-phenotype mappings¶

Once we have both the sample-to-tissue and tissue-to-phenotype mappings, we can combine them to get the sample-to-tissue-phenotype mappings: mappings between samples and phenotypes that occur in the tissue type of that sample. There isn’t a built-in Ontolopy function to do this, but since Ontolopy objects are built on top of Pandas DataFrames, they are fairly easy to work with.

Since we chose to map primary cell and tissue samples only, there are many samples which are not mapped. Null mappings are included in the output, and where mapping by name is used, it is recorded as a mapped_by_name_to relationship in the relation path, e.g. FF:11453-119A4.mapped_by_name_to~UBERON:0002048.is_a~UBERON:0000171.capable_of~GO:0007585 or in text Bronchial Epithelial Cell, donor4 mapped by name to lung is a respiration organ capable of respiratory gaseous exchange by respiratory system. Since relation strings can now contain FANTOM5, CL, UBERON and GO terms, I first merge the GO ontology into the merged FANTOM5 and Uberon ontologies, so that the names of all terms can be found for the relation text.

mapped_by_name_relation = 'mapped_by_name_to'

# merge ontology
merged = merged.merge(go_obo)

sample_to_tissue = {}
for sample, row_i in overall.iterrows():
    if not pd.isna(row_i['overall']):
        tissue = row_i['overall']
    else: # no mapping from sample to tissue
        tissue = np.nan
        phenotype = np.nan
        sample_to_tissue[(sample, tissue, phenotype)] = [np.nan, np.nan]
        continue
        
    try:
        mappings = up_down_phenotype_mapping_filtered.xs(tissue, level ='Tissue')
        for phenotype, row_j in mappings.iterrows():
            if row_i['mapped_by'] in ['ontology', 'both (same)']:
                relation_path = sample_to_tissue_mapping.loc[sample, 'relation_path'].replace(tissue, row_j['relation_path'])          
            elif row_i['mapped_by'] == 'name':
                relation_path = f"{sample}.{mapped_by_name_relation}~{tissue}".replace(tissue, row_j['relation_path'])
            else:
                logging.warning(f"{row_i['mapped_by']} has unexpected format")
                continue
            sample_to_tissue[(sample, tissue, phenotype)] = [relation_path, 
                                                             opy.relation_path_to_text(relation_path, merged)]
    except KeyError:
        # Save phenotype as NaN if no phenotype mapping:
        sample_to_tissue[(sample, tissue, np.nan)] = [sample_to_tissue_mapping.loc[sample, 'relation_path'], 
                                                      opy.relation_path_to_text(sample_to_tissue_mapping.loc[sample, 'relation_path'], merged)]
        

# Save out overall mapping:
sample_to_tissue_df = pd.DataFrame.from_dict(sample_to_tissue,
                                      orient='index',
                                      columns=['relation_path', 'relation_text'])
sample_to_tissue_df.index = pd.MultiIndex.from_tuples(sample_to_tissue_df.index, names=["sample", "tissue", "phenotype"])
mapping_file_path = '../c06-filter/data/created/fantom-go-mapping.csv'
sample_to_tissue_df.to_csv(mapping_file_path, sep = '\t')
display(sample_to_tissue_df.head(20))

6.5.3.1. Final mapping¶

# Glue final mapping statistics:
glue('sample-phen-rows', len(sample_to_tissue_df))
glue('total-sample-phen', len(sample_to_tissue_df.index.dropna()))

df = sample_to_tissue_df.reset_index()

glue('sample-phen-coverage-cat', f"{100*len(df[~df['phenotype'].isna() & df['sample'].isin(category_samples)]['sample'].unique())/float(len(df[df['sample'].isin(category_samples)]['sample'].unique())):.2f}%")
glue('sample-phen-coverage-all', f"{100*len(df[~df['phenotype'].isna()]['sample'].unique())/float(len(df['sample'].unique())):.2f}%")

glue('sample-tissue-nan', len(df[~df['tissue'].isna() & df['phenotype'].isna()]))
glue('total-sample-nan', len(df[df['tissue'].isna()]))

glue('notebook-time-mapping-example', f"{time.time()-notebook_start:.0f} seconds")

# Why are samples unmapped? 
# Count number of rows per unique tissue without phenotype mapping.
rows = []
for uberon in df[df['phenotype'].isna()]['tissue'].unique():
    if type(uberon) == float:
        if np.isnan(uberon):
            num = len(df[df['phenotype'].isna() & df['tissue'].isna()])
            row = ['NaN', 'Unmapped to tissue', num]
    else:
        num = len(df[df['phenotype'].isna() & (df['tissue'] == uberon)])
        row = [uberon, merged[uberon]['name'], num]
    rows.append(row)

unmapped_why = pd.DataFrame(rows, columns = ['Uberon ID', 'Uberon Name', 'Number samples mapped to tissue']);
unmapped_why.set_index('Uberon ID', inplace=True)
glue("unmapped-why", unmapped_why)

There are 65953 rows of the sample-to-tissue mapping DataFrame in total. This includes some NaN values, so it contains 65608 mappings from sample to phenotype; equivalent to a sample coverage of 92.34% of filtered (tissue and primary cell) samples or 81.00% of all samples. It also includes an additional 181 mappings from sample to tissue (but not to phenotype), and 164 samples with no mapping to tissue or phenotype. Fig. 6.14 shows why almost 10% of samples are unmapped in more detail: many samples map to the same unmapped tissues, particularly adipose tissue, epithelium, or connective tissue.

Uberon Name Number samples mapped to tissue
Uberon ID
NaN Unmapped to tissue 164
UBERON:0001013 adipose tissue 72
UBERON:0002331 umbilical cord 8
UBERON:0001359 cerebrospinal fluid 1
UBERON:0000483 epithelium 62
UBERON:0003693 retroperitoneal space 2
UBERON:0002384 connective tissue 28
UBERON:0003728 mediastinum 1
UBERON:0003688 omentum 6
UBERON:0005406 perirenal fat 1

Fig. 6.14 Table showing how many samples are mapped to each unmappable tissue, showing why the coverage of samples isn’t higher.¶

With Ontolopy, this complex task is relatively quick: it took 45 seconds to run this whole notebook on a laptop without any parallelisation. Also recall that this section is merely an example of an application of Ontolopy: this same process could be done for other datasets that provide an ontology and/or a sample information file.

previous

6.4. Example uses: mapping samples to diseases or phenotypes

next

6.6. Discussion

By Natalie Zelenka
© Copyright 2020.