6.2. Algorithm

In order to overcome the problem of predictors containing erroneous predictions due to a lack of gene expression information, I have created a lightweight tool which allows researchers to filter their phenotype or protein function predictions using tissue-specific gene expression information.

Drawing on the noble tradition of scientists naming things badly, I call this Filip as it is for Filtering predictions.

6.2.1. Overview

Fig. 6.1 illustrates Filip’s two-step approach, which aims to filter out predictions for proteins which are not created in the tissue of interest (related to the predicted phenotype). The filter is a simple rule-based tool, which is designed to be used on top of any protein function predictor, but would provide the most value for predictors that rely on structural or sequence similarity.

../_images/filip.png

Fig. 6.1 An illustration showing how Filip works. It’s a two-step process where protein-phenotype predictions are expected as input. In step 1,preprocessing, proteins are mapped to genes, and phenotypes are mapped to tissues. In step 2, filtering, Filip filters out any predictions where for which the gene is not expressed in the tissue.

6.2.2. Inputs

Three types of input are needed for Filip:

  1. Protein function predictions

  2. Normalised gene expression data.

  3. A map from gene expression samples to Uberon tissues.

6.2.2.1. Protein function predictions

Protein function predictions must be links between Protein identifiers and phenotype terms from GOBP, HP, MP or DOID ontologies. This is the standard for CAFA competitions.

6.2.2.2. Gene expression file

If Filip was a filter coffee machine, the gene expression (GE) file would be the (reusable) filter: it is the part that determines what can and cannot pass through the filter and it can be used with any kind of input predictions (coffee). Once we have the GE file, it can be reused for any different protein function predictor, as long as it predicts phenotype terms related to the samples in our GE file.

The user must also determine a cut-off: the minimum gene expression level to count as “expressed”. The higher the cut-off the more genes will count as unexpressed, and therefore more predictions will be filtered from the original list.

6.2.2.3. Sample-tissue map

Some GE datasets will include a sample-to-Uberon map as part of their metadata (e.g. FANTOM5). For those that don’t, the ontolopy Python package can be used to map between samples tissue names and their Uberon tissue.

6.2.3. Step 1: Preprocessing

The preprocessing file outputs:

  1. A phenotype-to-sample map, which stores a list of column indices in the gene expression file which Filip should use for filtering each phenotype.

  2. A protein-to-gene map, which maps between proteins present in the input predictions and genes present in the input GE file.

Mapping between phenotype and sample is the most invovled part of Filip: it relies on Ontolopy (next Chapter) to create this mapping.

6.2.4. Step 2: Filtering

The filtering step takes the orginal inputs, preprocessing outputs, and a GE cutoff as input. It outputs a reduced list of predictions that are still valid (are expressed above the cut-off on average across the samples).