Splice Site Predictor
Mark Wadsworth; mew225@gmail.com ; Route Y ID: mwadswor; Mentor: Dr. Perry G. Ridge, Biology
Abstract
With the advent of nextgeneration
sequencing, one of the unintended consequences is the sheer
number of genomic variations requiring interpretation. Mutations in splice sites have been shown to
contribute to the development of cancer [1], and dementia [2] among other potentially deadly disorders.
Roughly 14 million people are diagnosed with cancer every year [3], and roughly 7 million with dementia
[4]. Since these diseases cause an incredible amount of suffering, scientists in all fields are driven to search for ways to identify and treat them. These have been particularly difficult to interpret and have been largely ignored by the bioinformatics community most programs used for predicting the effects of these variants are out-of-date.
The development of this software will give researchers a great advantage in
searching for the effects of splice site variants, because they will be able to see which variants are most likely to affect splicing and in what ways. While being able to definitively know how a variant affects splicing requires experimentation, this algorithm will help researchers focus on only the most destructive variants. It will save them both time and money in their research. The code for this project will be available for free to the community to use, which will make it readily available for researchers to identify variants of interest and be able to predict their biological significance. Although I didn’t finish the project, it has been handed over to another lab member to finish it up.
Methods
I wrote the algorithm in Java. I started in Python, but as it became more complex I realized that I
needed a faster programming language that was more versatile. Java is the logical choice. I started by
using Annovar 5 to convert files and the annotate the variants in the genome. I then parsed that
information and used it in the analysis. I then used the Maximum Entropy Scan algorithm to score the
splicing variants. I used all of this information to decide if the variant is most probably damaging. We still need to link the program into the Protein Database, and PubMed to give the users more information. We are planning on presenting this software at the BIOT conference this fall.
Results and Discussion
One of the main difficulties that we found was the complete lack of reliable data in the realm of
splice sites. All of the data is at least a decade old and thus not reliable in the least. A future line of work that would really benefit the community would be to empirically test what splice site variants actually damage the function of the protein. If they then make that dataset publicly available, we would be able to understand a lot more about how splice site variants actually work.
The aforementioned difficulty made it so that we were unable to decisively say whether or not a
variant was damaging or not. All we can say is that it is most likely damaging, or not likely damaging,
with varying degrees of certainty in that spread.
This project has really helped me as a scientist because I have had to confront very difficult
problems and had to work out how to fix them. I have also met with defeat with multiple aspects of the
project and each time had to overcome the problems. For example the instance I just explained above. I
am very grateful to the ORCA Committee for giving me the chance to perform this research. I have
already felt the benefits of the project while in my first week of graduate school.
Citations
:
1. Friedman, Lori. “Confirmation of BRCA1 by analysis of germline mutations linked to
breast and ovarian cancer in ten families”. Nature Genetics 8 (4):
399–404. doi : 10.1038/ng1294399
2. Hutton, Mike; Lendon, Corinne; Rizzu, Patrizia; Baker, Matt (18 June 1998). “Association
of missense and 5’splicesite
mutations in tau with the inherited dementia FTDP17”.
Nature
(393): 702–705. doi : 10.1038/31508 .
3. World Health Organization. Fact Sheet 297. Accessed Sept. 10, 2015.
http://www.who.int/mediacentre/factsheets/fs297/en/
4. World Health Organization. Fact Sheet 362. Accessed Sept. 10, 2015.
http://www.who.int/mediacentre/factsheets/fs362/en/
5. Wang K, Li M, Hakonarson H. ANNOVAR: Functional annotation of genetic
variants from n extgeneration
sequencing data Nucleic Acids Research, 38:e164, 2010