Front Matter
Background
Phenotype prediction
Snowflake
Tissue-specific gene expression
Concluding remarks
End Matter
Chapter 1: Introduction
1.1 Unusual stylistic choices in this thesis
1.2 Research philosophy
Chapter 2: Biology Background
2.1 Big questions: What is genetically determined, and how?
2.1.1 History of inheritable traits
2.5.3 The future computational biologists want
2.2 Biological molecules: DNA, RNA, Proteins and the central dogma of molecular biology.
2.2.1 DNA
2.2.1.2 “DNA makes RNA”, a.k.a, transcription
2.2.1.3 “RNA makes Proteins”, a.k.a. Translation
2.2.1.4 “… and proteins do everything.”
2.3 A closer look at DNA: Genomes, Genes, and Genetic Variation
2.3.1 Genomes
2.3.2 The exome and the proteome
2.3.3 Genes
2.3.3.1 “A gene for X”
2.3.3.2 Units of heritability
2.3.4 Things that are not genes
2.3.5 Indels and Copy Number Variations
2.3.6 Single Nucleotide Polymorphisms
2.3.6.1 Non-synonymous SNVs
2.3.6.2 Synonymous SNVs
2.4 Looking more closely at proteins: function, structure and classification
2.4.1 Protein structure: Primary, Secondary, Tertiary, and Quaternary
2.4.1.1 Quaternary structures: protein domains
2.4.1.2 Disorder
2.4.1.3 Classifying proteins by domain: families and superfamilies
2.5 Phenotype
2.5.1 What is phenotype?
2.5.2 How do proteins influence phenotype?
2.5.2.1 Limits
2.5.2.2 Ethical considerations
2.6 Summary: how genotype and phenotype are linked
Chapter 3: Computational Biology Background
3.1 Sequencing and microarrays
3.1.1 Sequencing
3.1.1.1 Capped Analysis of Gene Expression
3.1.2 Alignment and assembly
3.1.3 Microarrays
3.2 From genotype to phenotype: what is measured
3.2.1 DNA
3.2.1.1 Whole genomes
3.2.1.2 The human reference genome
3.2.1.3 Genes
3.2.1.4 Variants
3.2.2 RNA
3.2.2.1 RNA Sequence and Structure
3.2.2.2 Gene Expression
3.2.2.3 RNA-Seq bioinformatics pipeline
3.2.3 Proteins
3.2.3.1 Protein Sequence
3.2.3.2 Protein Abundance
3.2.3.3 Protein Structure
3.2.4 Phenotypes
3.2.5 Measuring the connection between genotype and phenotype
3.2.5.1 Genome Wide Association Studies
3.2.5.2 Gene Knockouts
3.2.5.3 Biological Pathways
3.3 Ontologies
3.3.1 What are ontologies?
3.3.2 How are ontologies created, maintained, and improved?
3.3.3 Examples of ontologies
3.3.3.1 Gene Ontology
3.3.3.2 Uberon Ontology
3.3.3.3 Other Ontologies
3.3.4 Why are ontologies useful?
3.3.4.1 Term enrichment
3.3.5 File formats
3.5 Sources of bias in computational biology
3.5.1 Trusting the results of research
3.5.1.1 Science’s self correcting mechanism
3.5.2 The reproducibility crisis
3.5.3.1 Null Hypothesis Significance Testing
3.5.3.2 P-hacking and HARKing
3.5.3.3 Publication bias
3.6 Proteome Quality Index
3.6.3 PQI features
3.6.2 PQI metrics
3.6.6 Potential improvements
3.7 Summary
Chapter 4: Phenotype prediction with Snowflake
4.1 Introduction
4.1.1 Motivation
4.1.2 Related work
4.1.2.1 Phenotype predictors and variant prioritisation
4.1.2.2 Clustering and outlier-detection in genetics
4.1.2.3 Overcoming the curse of dimensionality through dimensionality reduction and feature selection
4.2 Snowflake Algorithm
4.2.1 Approach
4.2.2 How does it work?
4.2.2.1 SNPs are mapped to phenotype terms using DcGO and dbSNP
4.2.2.2 SNPs are given deleteriousness scores using FATHMM
4.2.2.3 Comparison to a background via clustering
4.2.4.5 Confidence score per phenotype
4.2.3 Functionality
4.2.4 Features added to the predictor
4.2.4.1 Different running modes
4.2.4.2 Adding SNP-phenotype associations from dbSNP
4.2.4.3 Dealing with missing calls
4.2.4.4 Reducing dimensionality
4.3 Creating Snowflake inputs
4.3.1 DcGO phenotype mapping file (human)
4.3.2 Background cohort
4.3.2.1 Data acquisition: the 1000 Genomes project
4.3.2.2 Create final input VCF
4.3.3 Consequence file
4.3.3.1 Run the Variant Effect Predictor tool
4.3.3.2 Query FATHMM and SUPERFAMILY for the SNPs of interest
4.3.3.3 Summary
4.3.4 Input cohort
4.3.4.1 23andMe file formats
4.3.4.2 Genome builds
4.4 Preprocessing
4.4.1 Combining VCF files, a.k.a. missing SNPs and ambiguous flips
4.4.1.1 Missing SNPs in VCF files
4.4.1.2 Ambiguous Flips
4.5 Considerations for Clustering SNPs
4.5.1 Combinations of SNPs
4.5.2 Choice of clustering methodology
4.5.2.1 Choice of distance metric
4.6 Discussion
4.6.1 Limitations
4.6.1.1 Genotype data
4.6.1.2 Equivalent terms
4.6.1.3 Coverage: Synonymous SNPs, nonsense and non-coding variants
4.6.1.4 Localised expression
4.6.2 Ethics self-assessment
Chapter 5: Predicting phenotypes of the ALSPAC cohort using Snowflake
5.1 Introduction
5.1.2 The ALSPAC cohort study
5.1.3 Experiment Design
5.1.3.1 Choosing phenotypes of interest
5.2 Discussion
5.2.1 Selection of phenotypes
5.2.2 Overlap between training and validation data
Chapter 6: Filtering computational predictions with tissue-specific expression information
6.1 Introduction
6.1.1 Motivation: improving phenotype and protein function prediction
6.1.2 When are transcripts “expressed”?
6.2 Algorithm
6.2.1 Overview
6.2.2 Inputs
6.2.2.1 Protein function predictions
6.2.2.2 Gene expression file
6.2.2.3 Sample-tissue map
6.2.3 Step 1: Preprocessing
6.2.4 Step 2: Filtering
6.3 Data
6.3.1 Expression data: FANTOM5
6.3.1.1 Data files and acquisition
6.3.1.2 Initial FANTOM5 data cleaning: sample info file
6.3.1.3 Initial FANTOM5 data cleaning: expression file
6.3.1.4 Exploratory Data Analysis
6.3.3 “Training” set: CAFA2
6.4 Validation method
6.4.1 Test set: CAFA3
6.4.2 Filip inputs for validation
6.4.2.1 Creating protein function predictions (DcGO)
6.4.3 Running Filip
6.4.4 Validation Methodology
6.4.4.1 Limitations of validation method
6.5 Filip results
6.5.1 CAFA 2
6.5.2 CAFA 3
6.6 Discussion and Future work
6.6.1 Coverage
6.6.1.1 Practical difficulties in finding and creating alternative input data
6.6.2 Wrongly filtered out tissues
6.6.3 Future work
6.6.3.1 Speed
6.6.3.2 Protein abundance
Chapter 7: Ontolopy
7.1 Introduction
7.1.1 Motivation
7.1.2 OBO files
7.1.2.1 Anatomy of an OBO file
7.1.3 Purpose
7.1.4 Other available tools
7.2 Functionality
7.2.1 Structure
7.2.2 Working with OBO ontologies
7.2.2.1 The Obo class
7.2.2.2 Merging ontologies
7.2.2.3 Loading ontologies from file
7.2.2.4 Downloading OBO files
7.2.3 Finding relationships
7.2.3.1 The Relations class
7.2.3.2 Converting “relation paths” to text
7.2.4 Creating Uberon Mappings
7.2.4.1 The Uberon class
7.2.4.2 Mapping from sample to tissue via name using Uberon.sample_map_by_name
7.2.4.3 Mapping from sample to tissue via ontology term using Uberon.sample_map_by_ont
7.2.4.4 Getting overall mappings and finding disagreements using Uberon.get_overall_tissue_mappings
7.3 Ontolopy tools and practices
7.3.1 Practices
7.3.2 Tools
7.4 Example uses: mapping samples to diseases or phenotypes
7.4.1 Inputs
7.4.1.1 FANTOM5
7.4.1.2 Uberon
7.4.2 Example 1: Finding disease-related samples
7.4.3 Example 2: Find tissues that are capable of cell differentiation
7.5 Example use: mapping samples to tissue-related phenotypes
7.5.1 Creating sample-to-tissue mappings
7.5.1.1 Load data and pre-filter
7.5.1.2 Mapping by ontology
7.5.1.3 Mapping by name
7.5.1.4 Combining mappings
7.5.1.6 Mapping overview
7.5.2 Creating tissue-to-phenotype mappings
7.5.2.1 Propagating relationships up the tree using part_of
7.5.2.2 Propagating “down” the tree: has_part
7.5.2.3 Propagating down the tree: inverse of part_of
7.5.2.4 Combining previous mappings
7.5.3 Creating sample-to-tissue-phenotype mappings
7.5.3.1 Final mapping
7.6 Discussion
7.6.1 Usefulness
7.6.2 Usability
7.6.3 Limitations
7.6.3.1 You still need to understand the structure of the ontology
7.6.3.2 “Missing” functionality
7.6.3.3 Improving choosing from multiple synonym options
7.7 Future Work
7.7.1 v2.0.0
7.7.2 Other potential improvements to Ontolopy
7.7.2.1 Text-search and fuzzy-matching
7.7.2.2 Functionality for more complex queries
7.7.2.3 opy.Go
7.7.2.4 Integration with Pronto
7.7.2.5 Ontology validity
7.7.3 Miscellaneous
Chapter 8: Combining RNA-seq datasets
8.1 Introduction
8.1.1 Motivation
8.1.2 Challenges in combining gene expression data sets
8.1.2.1 Harmonising meta-data
8.1.2.2 Batch effects
8.2 Data Acquisition
8.2.1 Criteria for choosing datasets
8.2.1.1 Gene expression vs protein abundance
8.2.1.2 Gene expression vs Transcript expression
8.2.1.3 Inclusion of CAGE data
8.2.1.4 Excluding disease-focused experiments
8.2.2 Method of searching
8.2.3 Eligible data sets
8.2.3.1 FANTOM5
8.2.3.2 Human Protein Atlas
8.2.3.3 Genotype Tissue Expression
8.2.3.4 Human Developmental Biology Resource
8.2.4 Data acquisition
8.3 Data Wrangling
8.3.1 Obtaining raw expression per gene for healthy human tissues
8.3.1.1 Mapping from transcript to gene
8.3.2 Mapping to UBERON
8.3.3 Aggregating Metadata
8.3.3.1 Tissue groups
8.3.4 Final Experimental Design
8.4 Results and discussion
8.4.1 Example: Tissue-specific expression comparison
8.4.2 Batch effects
8.4.3 Combining omics data sets is an opportunity to improve existing resources
8.4.4 Future Work
8.4.4.1 Mapping improvements
8.4.4.2 Batch effect removal
8.4.4.3 Tissue-specific vs cell specific
Chapter 9: Concluding remarks
Appendix
Bibliography