Ford, Donald
VelvetS: Software for Optimizing Velvet Input
Faculty Mentor: John Chaston, Plant & Wildlife Sciences
Introduction
Genome assembly is an important tool used to obtain more complete representations of entire genomes. Sequencing techniques, such as shotgun sequencing, are able to generate short pieces of a genome known as reads. Genome assemblers use these reads in order to produce larger genome sequences, known as contigs. This is done by computing how these reads overlap with each other. The ultimate goal is to create a single contig which represents an entire genome, though usually genome assemblers fall short of this goal.
Since it is difficult to create a single contig representing the entire genome, criteria are needed in order to determine what makes one genome assembly superior to another. The criteria that are used are these: number of contigs, max contig length, n50 and total number of reads used. The fewer contigs left at the end of an assembly the better, since the ultimate goal is to have a single contig representing the entire genome. Max contig length refers to the length of the longest generated contig. An assembly with more contigs may be better than an assembly with fewer contigs if those contigs are greater in length. n50 refers to the “median length-weighted contig length.”1 When sequencing a genome more reads are often generated than needed. This is to ensure that all parts of a genome are represented. However, once a certain threshold of sequence data has been hit, the utility of extra reads decreases. This leads to reads which are wasted, meaning they are not used in producing the final contigs. The total number of reads indicates how many of the initially submitted reads were used in creating the newly generated contigs, giving the user an idea of how much of their sequencing data was utilized. By looking at these different values, those familiar with genome assembly are able to determine the quality of a genome assembly.
One well known genome assembly tool, and the focus of this paper, is called Velvet.1 It is an open source software solution for genome assembly. There are a few factors that can aid in generating higher quality assemblies when using Velvet. One of the most important is the kmer length that is used during genome assembly. Kmer length refers to how many base pairs of a read are used when determining overlaps. Coverage cutoff is another input that can affect the quality of an assembly. It refers to how Velvet handles reads that are incorrectly identified as overlapping. The goal of this paper is to demonstrate how the quality of assemblies produced by Velvet can be optimized using wrapper software we developed known as VelvetS. VelvetS attempts to produce optimized genome assemblies by running Velvet with multiple kmer lengths and comparing their results. It also aims to reduce the amount of reads that are wasted when assembling a genome. It was built to make it easy to use for those less familiar with genome assembly.
Methods
In order to assess the quality of assemblies generated by VelvetS, it will be compared to another Velvet wrapper called VelvetOptimiser.2 VelvetOptimiser also chooses an optimal kmer length and coverage cutoff. At its most basic level, VelvetS follows the following workflow. First, VelvetS runs Velvet multiple times using different kmer lengths. It evaluates the optimal kmer length and coverage cutoff and then produces a final assembly using these optimized inputs (See figure 1). In order to select the optimum assembly, VelvetS defaults to selecting the assembly that utilized the greatest number of reads. The user may override this behavior, however, and choose the best assembly to use after inspecting it by hand. Coverage cutoff was determined by following the method outlined by the creators of Velvet.3
A major goal of VelvetS is to resolve the issue of having wasted reads. Instead of using all the reads in a single assembly we break the assembly process into a series of sub-assemblies (See Figure 2). It has been demonstrated that when a genome hits a certain threshold of coverage there is no further improvement in genome assembly.4 We calculate the coverage of a genome based on its estimated size, the length of its reads and the total number of reads. If coverage is greater than 200x (a threshold that ensures that more coverage will not be helpful) the reads are broken apart into separate files, one per each 200x of coverage (Example: 400x = 2 subassemblies, 600x = 3 subassemblies). If coverage is less than 200x, then the original read files are not broken apart and the workflow outlined in Figure 1 is followed. This method of breaking large read files into smaller subassemblies has previously been demonstrated as useful in producing more optimal assemblies.5
Results and Discussion
Results were generated using two different methods and can be found in Table 1. VelvetOptimiser was run once on each data set for comparison purposes. VelvetS was also run on each data set without breaking apart any read files. Testing is still in progress for runs done using the read breakup method and are marked as NA (some read files were too small to break apart and are indicated with a *). The chosen kmer length, number of contigs, max contig length, n50 and percent of reads used are reported for each assembly. The results in Table 1 demonstrate that VelvetOptimiser is superior to VelvetS in selecting optimized assemblies. It consistently produced fewer contigs and higher max values. It also produced higher n50 values. It can be concluded that choosing the best assembly based on the number of reads used is a less effective method, since often VelvetOptimiser used less reads but produced better assemblies.
Conclusion
It has been demonstrated that VelvetOptimiser is a very efficient tool in selecting optimized genome assemblies. It is still believed, however, that breaking apart read files can lead to even more optimal assemblies. Since VelvetOptimiser is an effective tool for optimizing Velvet output, VelvetS will be modified to use VelvetOptimiser in order to select optimized assemblies. VelvetOptimiser will be used to handle the work shown in Figure 1, while VelvetS will handle the breaking apart of reads and their reassembly back together depicted in Figure 2. Thus VelvetS will still fulfill the main goal of utilizing reads that are typically thrown out to produce more optimal genome assemblies.