Genome Annotation of Novel Viruses and Discovery of Critical Sequences in Genes Via Comparative Analysis of B4 Mycobacteriophage

Cameron Sargent and Dr. Sandra Burnett, Microbiology and Molecular Biology

Introduction

Over the past few years at Brigham Young University and other colleges nationwide, members of the Phage Hunters program have endeavored to find and analyze novel mycobacteriophage in an attempt to create new treatment and research methods for the pathogen Mycobacterium tuberculosis. Not only does this pathogen cause one of the deadliest diseases in the world today, but M. tuberculosis is also a bacterium whose traits, including a slow growth cycle and the ability to remain in an asymptomatic latent state for years prior to becoming virulent, greatly hinder our ability to adequately detect, treat, cure, and even perform laboratory research upon it. Bacteriophages, viruses that infect specific host bacteria, present a possible means of circumventing such hurdles. These viruses are simple life systems that can be used to detect, manipulate, and kill M. tuberculosis. However, successful applications of bacteriophage in combating bacteria have been limited in number, largely due to a lack of sufficient understanding of the interactions between viruses and their host bacteria. My research has contributed to this necessary understanding by successfully demonstrating a novel and effective approach to the analysis of viral genomes and functions. In this project, I first annotated, or predicted gene coding regions and gene functions of, six genomes of mycobacteriophage discovered by students in the Phage Hunters program here at BYU. I then compared three of these genomes to each other and to other genomes from their classification group, the mycobacteriophage B4 subcluster, to find areas of the genes that showed high conservation. High conservation, or high similarity, in a region of DNA indicates that the region is significant enough to the function of its protein product that any mutations or other modifications render it defective and prevent the normal viral reproduction cycle. At this point I focused on one specific protein common to these B4 viruses, a recombination directionality factor (RDF) protein that plays a critical role in viral propagation. I used protein folding software to predict the conformation of this protein in each of its viruses, located the regions of these protein structures that were highly conserved, and compared these to a protein with the same function from Enterobacteria phage lambda, called excisionase (Xis). Studies performed elsewhere have determined how this protein functions, and analysis showed that the conserved regions I identified by comparing genomes correspond to the functionally critical regions identified by others [1]. This proves that the procedure of identifying protein functional regions via comparative genomics produces the same results as other costlier, more difficult alternatives like NMR spectrometry or X-ray crystallography.

Methodology

Genomes from the mycobacterium phage Adawi, Bane1, Bane2, Fredward, PhrostyMug, and SargentShorty9, all isolated and sequenced through the Phage Hunters program at BYU, were annotated using DNA Master (version 5.22.8) with GeneMark and Glimmer auto-annotation. Gene calls were determined based on consideration of the following criteria: auto-annotation, BLASTn alignment E-values less than 0.001, GeneMark (Version 2.5) gene coding potential map predictions using Mycobacterium smegmatis as a model, start codon sequences, and Shine-Dalgarno scores above 200 using the Karlin position-specific scoring matrix (PSSM) for moderately-to-highly expressed genes. Annotated genomes were then finalized using Sequin (Version 1.0) and submitted to NCBI’s GenBank. Genomes of the B4 mycobacteriophage (Adawi, Bane1, Bane2, ChrisNMich, Cooper, Frederick, KayaCho, Stinger, and Zemanar) were then analyzed using Phamerator to find 26 gene sequences that were common to B4 phage but absent from almost all other phage. Further analysis of the viruses from this subcluster was also performed using genome dotplots created by Gepard (Version 1.30). All mycobacteriophage genomes located in the phagesdb.com database were then also analyzed using Phamerator to locate 88 gene sequences that were most common to mycobacteriophage universally. Putative protein product functions of the 114 total sequences were then assumed using BLASTn results of similar sequences whose function had previously been determined [2]. These gene sequences were also examined using gene alignment maps and pairwise identity matrices generated by Geneious (Version 5.6.6) using ClustalW algorithms to find regions of high conservation. At this point, one of the sequences from the B4 mycobacteriophage was studied in depth due to its suggested function as an RDF. The RDF proteins produced by this sequence in several viruses were then folded using RaptorX structure prediction online software (http://raptorx.uchicago.edu). Folded protein structures were then compared with the previously analyzed gene sequences to locate highly conserved regions; and with a protein with the same function, Xis, from phage lambda to locate functional regions [1].

Results

Following annotation of the six novel viral genomes, I submitted these genomes to NCBI’s GenBank; so far five of these have been processed and published online (see Table 1 for GenBank Accession Numbers), adding to the collection of data necessary for studies using comparative genomics. While comparing the viruses of the B4 subcluster I also discovered that one of these viruses, KayaCho, was dissimilar from other B4 phage enough to be classified separately. Although KayaCho is certainly more similar to the B4 phage than to other phage, weak genome dotplot alignments, genome BLASTn results, and gene ClustalW pairwise alignments revealed that KayaCho should be placed into its own subcluster. Other B4 phage genomes showed much higher similarity to each other, and more similarity even to phage in different B subclusters, than to KayaCho.

The genome and protein analysis identified the putative DNA binding sites of the RDF proteins from the B4 subcluster phage, minus KayaCho and including Acadian, a B5 phage whose RDF sequence was grouped with the B4 RDF sequences by Phamerator due to its high similarity. RDF functions by removing the viral DNA from the bacterial chromosome when the lysogenic phage begins its process of replication. In this process, RDF binds to a specific DNA sequence, bends the DNA to facilitate the excision of viral DNA, and recruits integrase, the protein that directly inserts and removes the viral genome. Consequently, the DNA interaction mechanism of RDF is most important to its function. Gene sequence alignments identified the region of highest conservation, an 18 base pair segment that has 94.8% homology and 15 identical bases. Protein folding revealed that this region creates an α-helix structure that has multiple basic residues (3 Arg and 1 Lys) aligned on one side of the helix. Studies performed by others have demonstrated that the Xis protein from lambda also contains an α-helix with basic residues aligned, and that this helix is the DNA sequence recognition and primary binding site of the protein, showing that this highly conserved region is indeed the most important region to the function of the protein [1]. The Xis protein also contains a “wing” motif located separately from the α-helix that includes a basic residue that allows it to bind to the DNA. This structure was also observed in the folded RDF proteins of the B4 phage, once again with high genomic conservation.

Academic outcomes of this study include the publication of five annotated genomes and the pending publication of another in NCBI’s GenBank, as well as the required data and a draft for a research journal publication in the near future. Furthermore, I have also submitted an abstract to present the results of this study at the Utah Conference on Undergraduate Research this coming February, and will also submit an abstract for presentation at the American Society for Microbiology Intermountain Branch Meeting this coming April.

Discussion

This specific application of the process of locating the functionally critical region of a protein using comparative genomics provides an invaluable tool in the study of proteins. Typical studies in elucidating the functional region of a protein, and thus understanding the mechanism of its function, require difficult, time consuming, and costly procedures such as gene sequence mutation tests, NMR spectrometry, and X-ray crystallography of the proteins and the molecules upon which they perform their function. Laboratory resources and equipment often cannot facilitate these studies, whereas a genomics-based approach can be performed using computers in even the simplest of laboratories. Even laboratories capable of the alternatives could benefit from this approach by identifying the critical regions of proteins first from comparative genomics and then confirming those predictions, without spending unnecessary time and resources analyzing the entirety of the protein product.

Conclusion

This study in comparatives genomics successfully demonstrates the value of this method in identifying the critical regions of proteins without the need for costly and difficult in vitro analysis. This study also displays the necessity of continually adding to the pool of genetic data and information, which greatly improves the efficacy and validity of genomics research. Furthermore, this study shows the value of constantly developing bioinformatics software and resources that can analyze genetic data to identify trends and make accurate predictions of unknown systems.

Table 1
Phage Name Adawi Bane1 Bane2 Fredward* PhrostyMug SargentShorty9
GenBank Accession Number KF279411 KF279412 KF279413 KF279414 KF279415 KF279416
* unpublished as of October 24, 2013

Sources

Sam, M.D., et al., Regulation of directionality in bacteriophage lambda site-specific recombination: structure of the Xis protein. J Mol Biol, 2002. 324(4): p. 791-805.
Hatfull, G.F., et al., Comparative genomic analysis of 60 Mycobacteriophage genomes: genome clustering, gene acquisition, and gene size. J Mol Biol, 2010. 397(1): p. 119-43.