Michael Porter and Dr. John Prince, Department of Chemistry and Biochemistry
Introduction
Proteins are an integral part of the cell. They are responsible for metabolism, DNA replication, transportation, and responding to changes in environment. Because of their importance to the cell, proteins are often drug targets due to the important roles they play in carrying out cellular function. Proteins are made by ribosomes which are responsible for translating proteins from mRNA. The end of translation is signaled by a stop codon in the RNA. However, in some organisms such as viruses and yeast, the stop codon may be bypassed in an event known as translational readthrough.
Translational readthrough, also known as stop codon readthrough, allows for the formation of proteins with an extended sequence which with our current knowledge of the genome are unknown and uncharacterized. Since proteins are often good candidates for drug targets, the more we know about the set of proteins in our bodies the better we are able to create new drugs. Identifying readthrough in humans could radically change our understanding of the cell and genomic expression. Translational readthrough gives the cell more control over the ratio between the two versions of the protein because it can be regulated at the translational level. Understanding what triggers readthrough could also prove useful in treating diseases such as cystic fibrosis and Duchenne muscular dystrophy which are caused by the premature termination of protein translation. Readthrough is a translation level event and therefore cannot be detected by genetic sequencing. Mass spectrometry is a powerful tool for protein identification, but it relies on databases of known protein sequences. To overcome these shortcomings, we created a custom database against which to compare the mass spectrometry data.
Methodology
Using human mRNA sequences from the UCSC Genome Browser, we created a custom protein database. The mRNA sequences were translated normally as well as with several translational level errors which can cause readthrough. This resulted in a large and extremely redundant database which then led to prohibitively long search times for the software. We took a novel approach in which a trypsin digestion was performed on each protein in silico and then we condensed down the resulting peptide list so that each peptide only appeared once. This resulted in a much smaller peptide database and consequently faster search times.
A dataset consisting of 2,212 mass spectrometry files from across the human proteome was downloaded from the PRIDE Archive (PXD000561) and searched. We used the freely available programs MSGF+, X!Tandem, and OMSSA to search the mass spectrometry files. A decoy database was searched concurrently to help determine the statistical significance of each identification. After searching, the results from the three search engines were pooled together and a false discovery rate of 1% was set. The resulting peptides were then matched back to the proteins they could have originated from. After the protein inference was performed, we filtered the list so that readthrough proteins were only kept if their normal parent sequence was also identified.
Results
After correcting for redundancies, 620 unique readthrough protein candidates were found with 64 of those appearing in at least ten samples. Aligning the protein and mRNA sequences did not reveal any common sequences. Previous studies show that some stop codons, such as UGA, are more likely than others to undergo readthrough, but our study showed no deviation from what would normally be expected in the cell.
Discussion
Mass spectrometry is a powerful tool used in protein identification, but only those proteins which are found in a database can be identified from the mass spectra. Although the spectra for readthrough proteins may already be present in previously obtained data, the proteins remain unidentified due to the absence of their sequences in existing databases. By predicting the sequences of readthrough proteins, we created a database targeted to the identification of readthrough proteins. This is the first attempt at using mass spectrometry to identify translational readthrough without prior knowledge of which proteins are undergoing readthrough. Our approach allows for fast identification of readthrough without the need for specialized, targeted assays.
Conclusion
The proteins identified by this study serve as a starting point for further research into translational readthrough and serve as targets for future studies to verify their existence and determine their function. We are now able to search any human dataset for translational readthrough and our approach can easily be extended to any organism for which a genomic sequence exists. The implications of readthrough in human development and disease such as cancer can now be studied on a much broader scale.