Colin Rogerson and Dr. W. Evan Johnson, Department of Statistics
The purpose of our research project was, originally, first to identify the likely binding sites of the Estrogen Receptor transcription factor, and second to identify likely co-factors that interact with Estrogen Receptor in the binding process. We planned to do this using a computational algorithm which scanned sequences of DNA, and by utilizing the position weight matrices of many transcription factors, statistically identify likely binding sites and co-factors.
This type of research is incredibly useful, because it provides researchers with a means of accurately predicting the binding behavior of specific transcription factors without expensive and time-consuming laboratory methods. The only stipulation is that we have access to the position weight matrix of the transcription factor of interest. This is one reason why the funds provided by the ORCA grant proved so helpful. We were able to take some of the funds and in cooperation with a biology lab on campus, purchase a subscription to the TRANSFAC biological database. This proved to be an invaluable tool in our research. Before this, we were limited to one hundred and twenty position weight matrices made available to the public through the JASPAR database. After the subscription to TRANSFAC, we gained access to nearly a thousand.
Due to the reasonable success we experienced in our studies of Estrogen Receptor, my mentor Dr. W. Evan Johnson received an offer to continue his research projects at the Huntsman Cancer Center in Salt Lake City. A few members of his lab (myself included) were offered internships there to further the research, and we accepted.
As our research commenced, I was assigned to the Susan Mango lab. Consequently, my studies shifted from the human genome to that of the nematode species C. Elegans. C. Elegans are an ideal model organism for genomic studies because they are easy to maintain, have a short lifespan, and are inexpensive to breed. By studying the behavior of transcription factors in these model organisms, we hope to increase our understanding of transcription factors in the human species as well.
We began our studies by applying the same method we had used in our Estrogen Receptor studies. We obtained sequences of C. Elegans’ DNA and scanned them for consensus binding sites using position weight matrices and our computational algorithm. The results we obtained were inconsequential, and very highly resembled the results from previous studies using previous methods.
As we refined our algorithm searching for ways to improve its accuracy in identifying significant transcription factors in C. Elegans, we were led to find by the head of the lab at Huntsman that a factor that she had found to carry a significant amount of weight in determining which binding sites were significant and which were not, is the level of conservation in the genome of C. Elegans as compared to other nematode species. For example, a consensus sequence found on the same chromosomal position in other nematode species as well as C. Elegans will probably have a greater impact on the expression of the gene than a consensus sequence found in C. Elegans and none of the other nematode genomes.
We spent a few months up at Huntsman obtaining the conservation ‘scores’ for every nucleotide position in the C. Elegans’ genome. The University of Santa Cruz genome browser had utilized what is known as a MultiZ alignment for six nematode species including C. Elegans. We used this data to construct a way to incorporate the conservation level of the DNA sequences into our algorithm, making it far more accurate in its predictive power.
We had planned to then use this new computational method to identify useful information for the biologists at Huntsman, and thereby obtain publication. However, upon completion of the algorithm and our presenting it to Susan Mango the head of the lab, she insisted that we could publish a paper solely on our new analysis method.
Over the course of the previous semester, we have improved our method and made it presentable to the public through a website. We call our new method Promatch. We have scanned the entire C. Elegans genome for the transcription factors which have position weight matrices in TRANSFAC. Promatch takes as input a list of genes of interest, then scans the database to determine whether or not each transcription factor is over-represented in the submitted list of genes or not as compared to a list of randomly selected genes.
The paper is nearing completion and will be submitted to the Genome Research journal for publication.