Corinne Sexton and Faculty Mentor: Perry Ridge, Biology
Bacteriophages are viruses that specifically target bacteria. With antibiotic resistance
on the rise, some researchers are looking to bacteriophages as a viable treatment
alternative. Phage therapy would be effective for a host of reasons, some of which are
because phages are 1) highly specific to their host bacteria, 2) very effective in lysing
bacteria targets, 3) non toxic to humans, and 4) easy to manufacture and cost effective.
(Oliveira 2015) Additionally, bacteriophages could be used effectively to treat bacterial
infections in plants and animals of agricultural importance.
One major roadblock to the therapeutic use of phages is the lack of understanding of
their genetic content. Diversity within the phage world is immense and understanding
that diversity is poses a major problem to those researching phages. Newly sequenced
phage genomes reveal that the majority of proteins share only low, or no, sequence
similarity with known phage proteomes. However, it’s likely that the majority of proteins
have similar functions and structures. Our algorithm facilitates the annotation of newly
sequenced phage proteomes, opening the door to functional studies of phage
proteins, and ultimately their use as a human and agricultural therapeutic.
After looking at several diverse methods to translate amino acid sequences into
proteins, we decided on DeepCNF as the most appropriate software for this project.
(Wang, Peng, Ma, & Xu, 2016) It reports around 84% accuracy. The only drawback to
this method was the long runtime. I ended up running around 60,000 jobs on Fulton
supercomputer to translate 8.7 million bacterial protein sequences into secondary
structure predictions. This entire process took around 6 months to complete.
I also used DeepCNF to translate all documented viral protein sequences, which is a
substantially smaller amount at 190,000 total. This took about a week. The output of
the DeepCNF algorithm delivers secondary structure in terms of 8 structures or 3
structures. I took these outputs from both the bacterial and the viral proteins and
created 2 different BLAST databases, a 3-structure database and an 8-structure
database. Results are described below.
We found that the 8-structure database performs poorly when comparing results of
known BLAST hits with sample sequences. A possible reason for that is because
DeepCNF only reports around 72% accuracy for 8-structure translations versus 84%
accuracy for 3-structure translations.
The 3-structure database performs very well when using a blastn command to
compare known protein sequences. The hits for known proteins run through the 3-
structure database match the results we find when using BLAST online which gives us
confidence in our method. Most importantly however is that for those proteins with no
known annotation, we get consistent hits across bacterial and viral translated proteins
which leads us to the conclusion that secondary structure BLAST may be a viable way
to annotate proteins of unknown function.
In conclusion, the preliminary results of this project are exciting. We know that our
method works to identify previously annotated proteins and hope that we will be able
to validate other identifications of lesser known proteins. It may be worth looking into
translating other organisms’ proteins to enlarge the database even further. As more
phage proteins are annotated, we will be able to manipulate phages for therapeutic
use. This algorithm takes us a step closer to that goal.
Oliveira, Hugo, et al. “Unexploited opportunities for phage therapy.” Frontiers in
pharmacology 6 (2015).
Wang, Sheng, Peng, Jian, Ma, Jianzhu, & Xu, Jinbo. (2016). Protein Secondary
Structure Prediction Using Deep Convolutional Neural Fields. Scientific Reports, 6,
18962. doi: 10.1038/srep18962