4.7. Discussion

4.7.1. Background

  • 1000 genomes has different priorities than us: does not care about rare SNPs - most likely to cause rare diseases

  • As diverse a bg set as we can get, but not very diverse.

  • Size/diversity of background set constrains how many SNPs we can hae

  • Same problem as PQI, the results are very sensitive to our background set (test?)

4.7.2. Difficulty in finding a test set

The snowflake project could be considered “blue-sky” curiosity-led research. The motivation for creating snowflake was our curiosity in seeing if the resources of Computational Biology could0 be used for the practical outcome of creating phenotype predictions. This was far from incremental, since other leading approaches predicted phenotypes on a phenotype-per-phenotype basis, or restricted the problem to prioritising variants. We can only test snowflake on data sets with both genetic and phenotype information across many phenotypes, which means it is very difficult to conclusively test (we have very low statistical power over all phenotypes).

It is disappointing that the phenotype predictor does not produce statistically significant results. However, the phenotype predictor may yet be useful for revealing candidate SNPs for certain kinds of diseases, and when a suitable data set becomes available (e.g. through the growing number of publicly available genotypes on platforms such as OpenSNP[209]), this method will still be ready to be tested. An alternative validation would be experimentally testing a prediction (e.g. with knockouts) of a phenotype with a highly interesting distribution of scores.

–>

4.7.3. Limitations

4.7.3.1. Genotype data

Genotype chips contain only a small fraction of the known disease-causing variants. For example, 23andMe tests for only 3 of thousands of known variants on the BRCA1 and BRCA2 genes implicated in hereditary cancer.

4.7.3.2. Equivalent terms

Despite much development effort, there remain some idiosyncrasies to the predictor. For example, DcGO can map multiple terms to the same set of SNPs. This can sometimes be a diverse group of phenotypes which do not tend to co-occur in individuals and when this occurs, it is likely that we cannot make a good prediction. A semantic similarity measure, such as GOGO[210] or Wang’s method[211] could be used to check this, and update the confidence score accordingly. It might be possible that constraining DcGO to use only more closely related species rather than the whole tree of life might be preferable for this task, however, this would be a trade off, as this change would also affect the predictor’s coverage of both phenotypes and variants.

4.7.3.3. Coverage of variants: Synonymous SNPs, nonsense and non-coding variants

There are also clearly many aspects of the molecular biology mentioned in chapter 2 that are not represented in the model used by the phenotype predictor. For example nonsense mutations, synonymous SNPs, regulatory networks, and non-coding variants. Updating the predictor to include these things could potentially give the predictor enough power to be validated on existing data sets.

For example, non-coding variants could be included by extending dcGO annotations to SNPs in linkage disequilibrium, and using the non-coding version of FATHMM, FATHMM-XF[212].

4.7.3.4. Localised expression

Another example is that dcGO does not take account of the environment of the cell (e.g. tissue-specific gene expression) in its’ predictions. Although domains which are statistically associated with phenotype can be present in a protein, there is no guarantee that the protein will have the opportunity to impact the phenotype (be transcribed).

In investigating some of the ALSPAC phenotype predictions, I identified that some of the predicted dcGO relations between proteins and ontology terms may not be expressed in the tissue of interest. This makes sense, since dcGO makes predictions on the basis of structure, but it’s common in molecular biology that cells, proteins or genes have theoretical functionality that is repressed or silenced by another mechanism, for example most human transposable elements or silenced, or in this case, repressors preventing gene expression in some cell types. Filtering out predictions for SNPs in these repressed genes is therefore a potential route to improve Snowflake, and this is the focus of the next part.

4.7.4. Ethics self-assessment

Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should. – Dr Ian Malcolm, Jurassic Park, Michael Crighton

Like the creation of dinosaurs, the Snowflake methodology itself (rather than a particular use of it) is not the sort of research that usually requires ethical review by Institutional Review Boards (IRBs). This is because most IRBs focus on issues of informed consent, data privacy, and other matters which could cause legal problems for universities, while Snowflake’s core methodology uses only publicly available data. As I previously mentioned, there are more general (wider, societal) ethical considerations relating to research in predicting phenotype.

With this in mind, I performed a self-assessment of the worst-case scenario outcomes of this research, in order to understand potential issues and think about what precautions should be put in place to avoid them. These extend out from this research itself, imagining future deployments. To this, I used the Data Hazards framework: a framework that is currently under development, and which I am currently working with the data science research community to develop. Table 4.2 contains the hazards that I felt applied to Snowflake, the reasons why, and what I recommend could be done to prevent these worst-case scenarios.

Table 4.2 The seven data hazards which I assessed as applying to Snowflake.

Label name

Label description

Label image

Reason for applying

Relevant safety precautions

Contains Data Science

Data Science is being used in this output, and any negative outcome of using this work are not the responsibility of “the algorithm” or “the software”, but the people using it.

fishy

Snowflake uses data, makes predictions, and uses unsupervised learning.

When snowflake is deployed in new contexts (e.g. patent licenses sold), it should be done with the understanding that the licensee becomes accountable for using it responsibly.

Reinforces existing biases

Reinforces unfair treatment of individuals and groups. This may be due to for example input data, algorithm or software design choices, or society at large.

hazard label for reinforces existing bias

Project does not check that the algorithm works just as well for non-white races, and we would expect it to work less well for them since they are less represented in the input data linking variants and diseases[213].

Snowflake’s efficiacy should be tested separately for each demographic that any deployment may effect.

Ranks or classifies people

Ranking and classifications of people are hazards in their own right and should be handled with care.

hazard label for ranks or classifies people

Project does not check that the algorithm works just as well for minority groups, who are less likely to be represented in the input data linking variants and diseases.

  • Snowflake’s efficiacy should be tested separately for minority groups, before deployment outside research (e.g. healthcare).

  • Appropriate phenotype terms should be curated before deployment (e.g. removing things like social behaviours, “intelligence” related terms, etc)

  • When or if to share rankings should be consdered carefully.

Lacks Community Involvement

This applies when technology is being produced without input from the community it is supposed to serve.

hazard label for lacks community involvement

The communities of people with the phenotypes have no current involvement in this process.

Relevant communities should be asked about their feelings towards phenotype prediction before deployment in order to curate a list of appropriate phenotype terms.

Danger of misuse

There is a danger of misusing the algorithm, technology, or data collected as part of this work.

hazard label for danger of misuse

The phenotype predictor is not expected to be accurate for all phenotypes, but It could even be used to try to predict phenotypes that are caused by the environment or regions of DNA it does not consider, if these are defined as genetic phenotypes in other literature.

If deployed outside of research, Snowflake should be tested for different types of phenotypes and which ones it does work for should first be understood.

Difficult to understand

There is a danger that the technology is difficult to understand. This could be because of the technology itself is hard to interpret (e.g. neural nets), or it’s implementation (i.e. code is hidden and we are not allowed to see exactly what it is doing).

hazard label for difficult to understand

Doesn’t use “black-box” machine learning (e.g. deep learning), but has closed source code and a complicated data pipeline.

  • If published for research, the code should be Open sourced and the code should be thoroughly documented and tested.

  • If provided for members of the public, explainers should be created similar to those that 23andMe has.

Privacy hazard

This technology may risk the privacy of individuals whose data is processed by it.

hazard label for difficult to understand

Individual’s genetic data is required to run the phenotype predictor. This has many privacy risks, for example identification, use by insurers, being contacted by unknown relatives.

  • Ensure there is explicit and well-informed consent from any future participants/users.

  • Store data securely.

Despite the tongue-in-cheek use of the Jurassic Park quote opening this subsection, I do think that phenotype prediction is something that we should attempt, due to its potential to help people. In “stop[ping] to think” about it, however, I applied 7 of the 11 existing data hazard labels, and set out some specific precautions for using it that I hope will be seriously considered by anyone using the method further. While some of these may seem far-fetched, Snowflake has already been trialled by a genomic analysis company used in clinical decision support.

The question of whether we “could” predict phenotype accurately is also a huge ethical barrier to using it at present. Currently, it’s not clear to what extent, or for which types of variants, the phenotype predictor works. The next chapter explains my attempts to validate the predictor using the ALSPAC dataset.

4.7.5. Future work

4.7.5.1. Dependencies, interoperability & simulating data

The phenotype predictor relies heavily on all forms of it’s input data: dcGO, FATHMM and the background cohort. dcGO decides which SNPs we consider at all for a phenotype, while FATHMM decides to what extent SNPs within that set would be interesting if we see a rare combination. And how rare the combination appears is defined by the background cohort. A limitation of this method is that it’s hard to test Snowflake’s approach to combining these types of data and clustering independently from these inputs.

I believe a synthetic (simulated) dataset would be important for testing any future iteration of Snowflake.

4.7.5.2. Update input components

All three of Snowflake’s input components (dcGO, FATHMM and the background cohort) have many possible choices - and while it is most important to find good test data, should that be found, finding the best choices of components would be a priority.

For example, Snowflake uses FATHMM-MKL rather than the newer and much more accurate FATHMM-XF. FATHMM-MKL is constrained to build 37 of the human reference genome which is no longer up to date.

4.7.6. Conclusions

While the results of my application of Snowflake to ALSPAC were disappointing, my technical contributions to Snowflake included finding and fixing crucial bugs, which allowed it to go on to it’s latest and most successful iteration as Nomaly[165].

Snowflake (and Nomaly) represent a highly novel approach to phenotype and variant function prediction, and it is possible that it’s limitations can be overcome as new datasets become available.