6.4. Validation method

In order to fairly test Filip, I entered it in the third CAFA competition, where it could be independently assessed by other researchers. In CAFA, each researcher can enter up to three methods, so I tested Filip by entering DcGO alone, and DcGO plus Filip, so that I could compare their performances.

6.4.1. Test set: CAFA3

After initial development, I entered DcGO only, and Filip-plus-DcGO into the CAFA3 competition in order to test Filip on an unseen dataset.

This meant that I did not download the CAFA3 ground-truth, as this analysis was done by the CAFA3 team, but only the CAFA3 targets, these continue to be available through the CAFA website.

Again, I used only the human targets (file target.9606.fasta). This is again a FASTA file, with the same format as for CAFA2, this time containing 20197 targets proteins.

6.4.2. Filip inputs for validation

As previously described, three types of input are needed for Filip:

  1. Protein function predictions

  2. Normalised gene expression data.

  3. A map from gene expression samples to Uberon tissues.

I described the gene expression data and metadata for (2) and (3) used for validation in the previous section.

6.4.2.1. Creating protein function predictions (DcGO)

I used DcGO as a test since I knew that it’s structure-centric prediction method didn’t include any gene expression information.

To create the input to DcGO, I used:

The script to create the UniprotKB IDs is available here, to create the input for DcGO is here. Then, to create the DcGO-only entry, I used the DcGOR library[219] (the dcAlgoPredictMain function).

import pandas as pd 
from myst_nb import glue

# TODO: Make drop-down. Glue table.
dcgo_predictions = pd.read_csv('data/created/dcgo_submission.txt', sep='\t', skiprows=4, index_col=0, header=None)
dcgo_predictions.drop('END', inplace=True)
dcgo_predictions.columns = ['phenotype', 'confidence']
dcgo_predictions.index.rename('protein', inplace=True)
glue('dcgo-predictions-view', dcgo_predictions.head())
glue('dcgo-cafa2-predictions', len(dcgo_predictions.index), display=False)
glue('dcgo-cafa2-proteins', len(dcgo_predictions.index.unique()), display=False)
glue('dcgo-cafa2-phenotypes', len(dcgo_predictions.phenotype.unique()), display=False)

The DcGO predictions contain only 15192949 of 20257 proteins and 15749 phenotype terms (all of which are GO terms).

6.4.3. Running Filip

I used an early version of ontolopy to map between uberon tissues and phenotypes. I describe this process in detail in Section 7.5: for CAFA3, I used phenotypes present in DcGO predictions as targets, and looked for mappings only including Uberon terms (not Cell Ontology terms).

The cut-off was chosen by plotting the distribution of TPM expression and choosing a value below which there appeared to be little noise (50 TPM) between biological and technical replicates.

6.4.4. Validation Methodology

This confidence score allows for a range of possible sets of predictions, depending on the threshold parameter \(\tau\). Precision (the proportion of selected items that are relevant), and recall (the proportion of relevant items that are selected) are defined in terms of true positives \(t_p\), false positives \(f_p\), and false negatives \(f_n\):

\(precision = p = \frac{t_p}{t_p + f_p}\)

\(recall = r = \frac{t_p}{t_p + f_n}\)

Precision-recall curves are generally used to validate a predictors performance, but the \(F_1\) measure combines these into a single measure of performance:

\(F_1 = 2 \frac{precision \cdot recall}{precision + recall}\)

Since the precision and recall will be different for any \(\tau\), the \(F_{max}\) score is the maximum possible \(F_1\) for any value of \(\tau\).

CAFA validation can either be term-centric or protein-centric. For each option, submissions are assessed per species and for wholly unknown and partially known genes separately.

6.4.4.1. Limitations of validation method

There is no penalty for making a broad guess, or reward for making a precise one. This is one of the reasons that the naive method does so well: for example it is not penalised for guessing that the root term of the GO BPO ontology Biological Process is related to every gene.

Due to the nature of the validation set, it’s possible that the best-scoring CAFA methods simply predict which associations are likely to be discovered soon (i.e. associations to genes people are currently studying, which is well-predicted by genes that have recently been studied).