Discussion
Contents
4.6. Discussion¶
4.6.1. Limitations¶
4.6.1.1. Genotype data¶
Genotype chips contain only a small fraction of the known disease-causing variants. For example, 23andMe tests for only 3 of thousands of known variants on the BRCA1 and BRCA2 genes implicated in hereditary cancer.
4.6.1.2. Equivalent terms¶
Despite much development effort, there remain some idiosyncrasies to the predictor. For example, DcGO can map multiple terms to the same set of SNPs. This can sometimes be a diverse group of phenotypes which do not tend to co-occur in individuals and when this occurs, it is likely that we cannot make a good prediction. A semantic similarity measure, such as GOGO[191] or Wang’s method[192] could be used to check this, and update the confidence score accordingly.
4.6.1.3. Coverage: Synonymous SNPs, nonsense and non-coding variants¶
There are also clearly many aspects of the molecular biology mentioned in chapter 2 that are not represented in the model used by the phenotype predictor. For example nonsense mutations, synonymous SNPs, regulatory networks, and non-coding variants. Updating the predictor to include these things could potentially give the predictor enough power to be validated on existing data sets.
For example, non-coding variants could be included by extending dcGO
annotations to SNPs in linkage disequilibrium, and using the non-coding version of FATHMM
, FATHMM-XF
[193].
4.6.1.4. Localised expression¶
Another example is that dcGO does not take account of the environment of the cell (e.g. tissue-specific gene expression) in its’ predictions. Although domains which are statistically associated with phenotype can be present in a protein, there is no guarantee that the protein will have the opportunity to impact the phenotype (be transcribed).
In investigating some of the ALSPAC phenotype predictions, I identified that some of the predicted dcGO relations between proteins and ontology terms may not be expressed in the tissue of interest. This makes sense, since dcGO makes predictions on the basis of structure, but it’s common in molecular biology that cells, proteins or genes have theoretical functionality that is repressed or silenced by another mechanism, for example most human transposable elements or silenced, or in this case, repressors preventing gene expression in some cell types. Filtering out predictions for SNPs in these repressed genes is therefore a potential route to improve Snowflake, and this is the focus of the next part.
4.6.2. Ethics self-assessment¶
Your scientists were so preoccupied with whether or not they could, they didn’t stop to think if they should. – Dr Ian Malcolm, Jurassic Park, Michael Crighton
Like the creation of dinosaurs, the Snowflake methodology itself (rather than a particular use of it) is not the sort of research that usually requires ethical review by Institutional Review Boards (IRBs). This is because most IRBs focus on issues of informed consent, data privacy, and other matters which could cause legal problems for universities, while Snowflake’s core methodology uses only publicly available data. As I previously mentioned, there are more general (wider, societal) ethical considerations relating to research in predicting phenotype.
With this in mind, I performed a self-assessment of the worst-case scenario outcomes of this research, in order to understand potential issues and think about what precautions should be put in place to avoid them. These extend out from this research itself, imagining future deployments. To this, I used the Data Hazards framework: a framework that is currently under development, and which I am currently working with the data science research community to develop. Table 4.2 contains the hazards that I felt applied to Snowflake, the reasons why, and what I recommend could be done to prevent these worst-case scenarios.
Label name |
Label description |
Label image |
Reason for applying |
Relevant safety precautions |
---|---|---|---|---|
Contains Data Science |
Data Science is being used in this output, and any negative outcome of using this work are not the responsibility of “the algorithm” or “the software”, but the people using it. |
Snowflake uses data, makes predictions, and uses unsupervised learning. |
When snowflake is deployed in new contexts (e.g. patent licenses sold), it should be done with the understanding that the licensee becomes accountable for using it responsibly. |
|
Reinforces existing biases |
Reinforces unfair treatment of individuals and groups. This may be due to for example input data, algorithm or software design choices, or society at large. |
Project does not check that the algorithm works just as well for non-white races, and we would expect it to work less well for them since they are less represented in the input data linking variants and diseases[194]. |
Snowflake’s efficiacy should be tested separately for each demographic that any deployment may effect. |
|
Ranks or classifies people |
Ranking and classifications of people are hazards in their own right and should be handled with care. |
Project does not check that the algorithm works just as well for minority groups, who are less likely to be represented in the input data linking variants and diseases. |
|
|
Lacks Community Involvement |
This applies when technology is being produced without input from the community it is supposed to serve. |
The communities of people with the phenotypes have no current involvement in this process. |
Relevant communities should be asked about their feelings towards phenotype prediction before deployment in order to curate a list of appropriate phenotype terms. |
|
Danger of misuse |
There is a danger of misusing the algorithm, technology, or data collected as part of this work. |
The phenotype predictor is not expected to be accurate for all phenotypes, but It could even be used to try to predict phenotypes that are caused by the environment or regions of DNA it does not consider, if these are defined as genetic phenotypes in other literature. |
If deployed outside of research, Snowflake should be tested for different types of phenotypes and which ones it does work for should first be understood. |
|
Difficult to understand |
There is a danger that the technology is difficult to understand. This could be because of the technology itself is hard to interpret (e.g. neural nets), or it’s implementation (i.e. code is hidden and we are not allowed to see exactly what it is doing). |
Doesn’t use “black-box” machine learning (e.g. deep learning), but has closed source code and a complicated data pipeline. |
|
|
Privacy hazard |
This technology may risk the privacy of individuals whose data is processed by it. |
Individual’s genetic data is required to run the phenotype predictor. This has many privacy risks, for example identification, use by insurers, being contacted by unknown relatives. |
|
Despite the tongue-in-cheek use of the Jurassic Park quote opening this subsection, I do think that phenotype prediction is something that we should attempt, due to its potential to help people. In “stop[ping] to think” about it, however, I applied 7 of the 11 existing data hazard labels, and set out some specific precautions for using it that I hope will be seriously considered by anyone using the method further. While some of these may seem far-fetched, Snowflake has already been trialled by a genomic analysis company used in clinical decision support.
The question of whether we “could” predict phenotype accurately is also a huge ethical barrier to using it at present. Currently, it’s not clear to what extent, or for which types of variants, the phenotype predictor works. The next chapter explains my attempts to validate the predictor using the ALSPAC dataset.