Mark Kendell Clement and Dr. Sean Warnick, Computer Science
The Needleman-Wunsch algorithm is a classic example of dynamic programming – a technique that solves large problems by breaking them down into small problems. The simplest problems are solved first and are then used to solve the larger problems. Over this past year, I studied the impact of computer science optimization techniques on the Needleman-Wunsch algorithm as applied to DNA sequence alignment.
Traditionally sequence alignment is at the base of almost every genomic analysis technique. Sequence alignment attempts to align or match one sequence of DNA to another sequence of DNA. Over time, and due to evolutionary pressure, changes or mutations become part of the DNA of an individual, and eventually an entire species. Sequence alignment allows researchers to compare different DNA sequences to uncover these mutations. An understanding of these mutations helps medical researchers discover how DNA mutations affect disorders. In the field of pharmacogenomics, a knowledge of these mutations can help doctors know which drugs should be prescribed in order to avoid dangerous side effects while maintaining potency. However, the first step in any of these processes is alignment with the Needleman-Wunsch algorithm. Especially when sequences are long (one human chromosome is over 200 million base pairs long), optimization is very important for genomic research.
Under the direction of my mentor, Dr. Warnick, I began my research by implementing the standard Needleman-Wunsch alignment algorithm. After discussing computer science optimization techniques, I implemented these techniques and was interested in the results. Using the classic optimization techniques as a springboard, I also implemented several other optimization techniques. I developed an application to allow users to align sequences using the different techniques, while comparing accuracy, speed, and efficiency of the different methods.
As a complete analysis package, my application compared the original Needleman-Wunsch alignment against three optimizes methods: the Pruner, Branch and Bound, and Max Gap. The differences between these optimization techniques, as well as their comparative advantages are discussed below.
At a high level, the vanilla Needleman-Wunsch algorithm computes the best alignment by assuming that the best alignment of length k came from the best alignment of length k-1 and the best alignment the kth item. The best alignment at each point is determined by a score which is computed based on the number of matches, mismatches, and gaps in the alignment. Although there may be multiple best-scoring alignments, these alignments are assumed to be equal in quality. Accuracy of each optimization was computed by comparing the score of the alignment produced by the vanilla Needleman-Wunsch algorithm based on the score of the alignment produced by the optimized algorithm.
The Pruner method aligned the sequences, but at each step, as the compete alignment was gradually completed, if the partial alignment had a score less than the score of a naïve alignment, the alignment was not pursued. This resulted an efficiency of between 48% in the best case, and 58% in the worst case. The accuracy of this optimization was 100%. This meant that this optimization allowed us to compute the optimum alignment approximately twice as fast as the vanilla algorithm.
The Branch and Bound optimization is a standard optimization used in computer science. This optimization discards possible alignments when they have scores less than any other current alignment. When applied to the Needleman-Wunsch alignment problem, it was also able to find the optimum alignment with 100% accuracy. It also outperformed the Pruner optimization in regards to speed, with an efficiency of between 33% and 45%, meaning that in cases of alignment of closely-related species, this optimization allowed the algorithm to run about three times as quickly. I also experimented with a variation of the Branch and Bound optimization that I thought would be especially applicable to the DNA alignment problem. This optimization tried to include the biological assumption that the two aligned sequences would be similar up to a certain threshold. Unfortunately, this method did not yield any significant advantages.
The Max Gap alignment was also based on the assumption that the person who is aligning the sequences should be able to guess how closely the two sequences are related. Because of this assumption, the algorithm was not able to arrive at the best alignment every single time because the user may have assumed that the sequences were more similar than they actually were. However, as long as the user input was generous enough, the algorithm produced the correct alignment. If the user did underestimate the similarity, the algorithm returned a failed result. For what this optimization lacked in accuracy, it made it up in speed. This algorithm ran with an efficiency of around 1% to 3%, based on user input. When compared to the other optimizations, the Max Gap optimization allowed almost instantaneous alignment in linear time.
I learned a lot about optimization techniques, and was pleased at the efficiency results that I was able to achieve. I was invited to present my research at the Utah Conference on Undergraduate Research. This was a great learning experience, and I learned the importance of simplifying research so that a general audience can understand it. I used these lessons as I prepared for the Spring Research Conference through the BYU College of Physical and Mathematical Sciences. Because I was able to make my research appealing to a general audience, I was selected to present in the general session, to which high school students and the general public was invited.
As I look back on the lessons I have learned about algorithm optimization, working with a mentor, doing research, and presenting results are invaluable. I am grateful to the Office of Research and Creative Activities and my advisor, Dr. Warnick for their contributions to this research experience.