Development of Software Program for Organizing DNA Sequence Data from GenBank

Heather Vernon and Dr. David McClellan, Integrative Biology

Nucleotide sequence data on GenBank (a government sponsored repository located on http://www.ncbi.nlm.nih.gov/entrez/) is organized in a flat file format that can be accessed by searches. Obtaining a large volume of sequences from GenBank is a tedious process because the sequence data must be extracted from the flat files for every sequence. The goal of this project was develop a java program to upload many sequences from GenBank in one file, extract the gene names to allow the user to group synonyms, and save the data into either tab-delimited or FASTA format for easy importation into other sequence manipulating programs. This program was seen as a major step to streamline the analysis process in our lab.

In order to extract sequences from a GenBank sequence file, I had to decide on a file format that I could process easily in Java and had all of the sequence identification information needed to appropriately organize the data. I decided on GBSeqXML, because I could use the tags to find the information that I needed without having to deal with the excess of tags that exist in the full GenBank XML format. To create the sequence import file, the user must search for sequences on GenBank, display them all together in GBSeqXML, and save them into a text file from the Internet browser. The speed of the upload into the program and the possible upload file size is primarily dependent on the capacity of the computer. The program does not have specific file size restrictions built in. I had to allocate Java additional virtual memory so that the program could handle large files.

Do due a lack of naming procedures for GenBank sequences, the name of gene can vary greatly due to the different submission preferences. For example, the gene cytochrome b can be found on GenBank under the gene names of cyt b, cyt-b, and cytochrome b. Misspelled versions of the gene names can also be found, especially when there are thousands of sequence entries for that gene. These “synonyms” of the gene name can make grouping sequences painful because the sequences cannot be simply sorted by the gene name. There is also no guarantee that all of the sequences in the search results are relevant to the desired gene. The gene name may just happen to exist in the comments sequence of the sequence record. Our program allows users to create groups of names and eliminate undesired sequences without having to pick through the sequence file by hand.

Output files can be created from all of the dataset or a subset based on user preferences in either a tab-delimited or FASTA format. The simplest dataset is simply all of the uploaded sequences in the desired format. Datasets can also be pared down by choosing to have only a certain number of sequences outputted per gene name or per user-defined group. This allows datasets to have an easily controlled N for later analyses. The program chooses between the available sequences by selecting the longer (and presumably most complete) sequences first.

After I finished the preliminary version of this program, I tested it for feasibility in a sample sequence evolution analysis project. I downloaded all of the mitochrondrial, mammalian genes from GenBank, which formed a dataset of 107 genomes with 13 genes each. Next, I uploaded this 7.8 MB file into the new program. To verify that each GenBank record had each of the 13 genes, I outputted the sequences into a tab-delimited format and opened the file in Excel. After verifying that I had complete datasets, I outputted each gene into a separate FASTA file. This completed the use of the new program. I then ran the sequence through two programs to put the sequences in the correct reading frame without stop codons and convert them into amino acid sequences (more meaningful for alignments in an evolutionary context) , align the amino acid sequences , and convert the sequences back into an aligned nucleotide format1. This entire process took 20 minutes, instead of the weeks and months that it previously took. After aligning the sequences, the file format was converted to NEXUS with Seaview to allow several evolutionary analysis programs to read the sequences. Ideal evolutionary models were determined with PAUP and ModelTest, gene trees were derived with MrBayes, and a consensus tree (the final product) was created with PAUP. The latter half the analyses took about three weeks because of their complexity, but that was just a matter of waiting for the computer to process the information.

In short, the amount of time needed to organize and manipulate the sequence dataset for analysis in various programs was drastically reduced. Additional benefits include the fact that the program can be run remotely on a large capacity computer and that the resultant datasets are the largest possible and contain the most current data available on GenBank.

This program has several aspects that could be improved. First, the program could be updated to a database application with a graphical interface. Second, the versatility and efficiency of the program could be increased for compiling nucleotide data sets. Lastly, the program could incorporate direct GenBank database access or a local database equivalent to reduce the effort needed to upload the sequences into the program.

Brigham Young University

Journal of Undergraduate Research

Development of Software Program for Organizing DNA Sequence Data from GenBank

Heather Vernon and Dr. David McClellan, Integrative Biology