4.6. Discussion

4.6.1. Limitations

4.6.1.1. Genotype data

Genotype chips contain only a small fraction of the known disease-causing variants. For example, 23andMe tests for only 3 of thousands of known variants on the BRCA1 and BRCA2 genes implicated in hereditary cancer.

4.6.1.2. Equivalent terms

Despite much development effort, there remain some idiosyncrasies to the predictor. For example, DcGO can map multiple terms to the same set of SNPs. This can sometimes be a diverse group of phenotypes which do not tend to co-occur in individuals and when this occurs, it is likely that we cannot make a good prediction. A semantic similarity measure, such as GOGO[191] or Wang’s method[192] could be used to check this, and update the confidence score accordingly.

4.6.1.3. Coverage: Synonymous SNPs, nonsense and non-coding variants

There are also clearly many aspects of the molecular biology mentioned in chapter 2 that are not represented in the model used by the phenotype predictor. For example nonsense mutations, synonymous SNPs, regulatory networks, and non-coding variants. Updating the predictor to include these things could potentially give the predictor enough power to be validated on existing data sets.

For example, non-coding variants could be included by extending dcGO annotations to SNPs in linkage disequilibrium, and using the non-coding version of FATHMM, FATHMM-XF[193].

4.6.1.4. Localised expression

Another example is that dcGO does not take account of the environment of the cell (e.g. tissue-specific gene expression) in its’ predictions. Although domains which are statistically associated with phenotype can be present in a protein, there is no guarantee that the protein will have the opportunity to impact the phenotype (be transcribed).

In investigating some of the ALSPAC phenotype predictions, I identified that some of the predicted dcGO relations between proteins and ontology terms may not be expressed in the tissue of interest. This makes sense, since dcGO makes predictions on the basis of structure, but it’s common in molecular biology that cells, proteins or genes have theoretical functionality that is repressed or silenced by another mechanism, for example most human transposable elements or silenced, or in this case, repressors preventing gene expression in some cell types. Filtering out predictions for SNPs in these repressed genes is therefore a potential route to improve Snowflake, and this is the focus of the next part.

4.6.2. Ethics self-assessment

Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should. – Dr Ian Malcolm, Jurassic Park, Michael Crighton

Like the creation of dinosaurs, the Snowflake methodology itself (rather than a particular use of it) is not the sort of research that usually requires ethical review by Institutional Review Boards (IRBs). This is because most IRBs focus on issues of informed consent, data privacy, and other matters which could cause legal problems for universities, while Snowflake’s core methodology uses only publicly available data. As I previously mentioned, there are more general (wider, societal) ethical considerations relating to research in predicting phenotype.

With this in mind, I performed a self-assessment of the worst-case scenario outcomes of this research, in order to understand potential issues and think about what precautions should be put in place to avoid them. These extend out from this research itself, imagining future deployments. To this, I used the Data Hazards framework: a framework that is currently under development, and which I am currently working with the data science research community to develop. Table 4.2 contains the hazards that I felt applied to Snowflake, the reasons why, and what I recommend could be done to prevent these worst-case scenarios.

Table 4.2 The seven data hazards which I assessed as applying to Snowflake.

Label name

Label description

Label image

Reason for applying

Relevant safety precautions

Contains Data Science

Data Science is being used in this output, and any negative outcome of using this work are not the responsibility of “the algorithm” or “the software”, but the people using it.

fishy

Snowflake uses data, makes predictions, and uses unsupervised learning.

When snowflake is deployed in new contexts (e.g. patent licenses sold), it should be done with the understanding that the licensee becomes accountable for using it responsibly.

Reinforces existing biases

Reinforces unfair treatment of individuals and groups. This may be due to for example input data, algorithm or software design choices, or society at large.

hazard label for reinforces existing bias

Project does not check that the algorithm works just as well for non-white races, and we would expect it to work less well for them since they are less represented in the input data linking variants and diseases[194].

Snowflake’s efficiacy should be tested separately for each demographic that any deployment may effect.

Ranks or classifies people

Ranking and classifications of people are hazards in their own right and should be handled with care.

hazard label for ranks or classifies people

Project does not check that the algorithm works just as well for minority groups, who are less likely to be represented in the input data linking variants and diseases.

  • Snowflake’s efficiacy should be tested separately for minority groups, before deployment outside research (e.g. healthcare).

  • Appropriate phenotype terms should be curated before deployment (e.g. removing things like social behaviours, “intelligence” related terms, etc)

  • When or if to share rankings should be consdered carefully.

Lacks Community Involvement

This applies when technology is being produced without input from the community it is supposed to serve.

hazard label for lacks community involvement

The communities of people with the phenotypes have no current involvement in this process.

Relevant communities should be asked about their feelings towards phenotype prediction before deployment in order to curate a list of appropriate phenotype terms.

Danger of misuse

There is a danger of misusing the algorithm, technology, or data collected as part of this work.

hazard label for danger of misuse

The phenotype predictor is not expected to be accurate for all phenotypes, but It could even be used to try to predict phenotypes that are caused by the environment or regions of DNA it does not consider, if these are defined as genetic phenotypes in other literature.

If deployed outside of research, Snowflake should be tested for different types of phenotypes and which ones it does work for should first be understood.

Difficult to understand

There is a danger that the technology is difficult to understand. This could be because of the technology itself is hard to interpret (e.g. neural nets), or it’s implementation (i.e. code is hidden and we are not allowed to see exactly what it is doing).

hazard label for difficult to understand

Doesn’t use “black-box” machine learning (e.g. deep learning), but has closed source code and a complicated data pipeline.

  • If published for research, the code should be Open sourced and the code should be thoroughly documented and tested.

  • If provided for members of the public, explainers should be created similar to those that 23andMe has.

Privacy hazard

This technology may risk the privacy of individuals whose data is processed by it.

hazard label for difficult to understand

Individual’s genetic data is required to run the phenotype predictor. This has many privacy risks, for example identification, use by insurers, being contacted by unknown relatives.

  • Ensure there is explicit and well-informed consent from any future participants/users.

  • Store data securely.

Despite the tongue-in-cheek use of the Jurassic Park quote opening this subsection, I do think that phenotype prediction is something that we should attempt, due to its potential to help people. In “stop[ping] to think” about it, however, I applied 7 of the 11 existing data hazard labels, and set out some specific precautions for using it that I hope will be seriously considered by anyone using the method further. While some of these may seem far-fetched, Snowflake has already been trialled by a genomic analysis company used in clinical decision support.

The question of whether we “could” predict phenotype accurately is also a huge ethical barrier to using it at present. Currently, it’s not clear to what extent, or for which types of variants, the phenotype predictor works. The next chapter explains my attempts to validate the predictor using the ALSPAC dataset.