Zachary Liechty and Joshua Udall, Plant and Wildlife Sciences
Many plants undergo polyploidization events throughout their history, meaning their genome doubles1; the goal of this project was to identify how these polyploidization events lead to changes in gene expression on a nucleotide level. Polyploidization events provide raw material to be acted upon by natural selection, allowing evolution to occur. Now with four copies of a gene instead of two, mutations can occur or expression levels can change without too great an influence on the plants well-being. This project examines factors relating to changes in gene expression between subgenomes after a polyploidization event among different cotton species. We first hypothesized that the expression bias was caused by insertions or deletions in the promoter region of the genes, meaning the copy of the gene on the subgenome that is less expressed will have more insertions and deletions in its promoter. By understanding expression bias, we can better understand how evolution affects plants after polyploidization events.
Four accessions of tetraploid cotton (Gossypium tomentosum, two accessions of G. hirsutum, and a hybrid between the A and D diploid species) had previously been sequenced using whole genome shotgun sequencing (WGS). This method is good for large scale drafts of a genome, but often does not give the precision necessary for analyzing insertions or deletions. Therefore, Sanger sequencing was used to get a better understanding of desired promoter regions.
19 genes were selected that had an expression bias between the two copies of the genome. Primers were designed to surround the promoter region of the genes, and the region was amplified. The promoters were then ligated into a viral vector including ampicillin resistance, and transformed into E. coli. Colonies were grown on ampicillin-containing plates for 24 hours, and bacterial colonies containing vectors were selected and sequenced using Sanger sequencing. since the primers would isolate both copies of the tetraploid genome, this transformation process was used to ensure that both of the subgenomes would be separated for sequencing.
After Sanger sequencing, the results were first compared to the previously sequenced whole genome shotgun sequence. The comparisons revealed that, as suspected, much of the WGS data did not include insertions or deletions present in the Sanger sequencing. However, the WGS data did have some of the insertions and deletions in them. This lead us to the conclusion that WGS can be used to identify insertions and deletions on a large scale in general terms, but will not produce exact data like Sanger sequencing will.
The main use of the Sanger sequences was to confirm the presence of insertions and deletions in one of the subgenomes but not the other. The Sanger sequences were separated into the A and D subgenome using homoeo-SNPs (single nucleotide polymophisms that are unique to a subgenome). The comparison of 14 of those genes revealed 6 cases where the subgenome with more insertions and deletions was the lower expressed subgenome, as expected. Three cases had the expression flipped, and 5 cases had no significant insertions or deletions in the sequenced region.
This data suggests that insertions and deletions in the promoter region could be one cause of the expression bias, but is probably not the only one. The expression bias could also be caused by insertions and deletions occurring after the gene coding region, other activation sights, or at splice regions within the gene. These alternatives are currently being investigated.
To examine this bias on a larger scale, WGS data was compared among 478 genes that were biased the same way in all four accessions of cotton. the promoter regions were extracted and the number of insertions and deletions were compared between the two subgenomes. The WGS data was aligned the subgenome and tetraploid genome references, and the number of deletions was calculated using Indelible (from the Bambam suite of bioinformatic tools developed by our lab). However, depending on what reference genome we used (the A or D diploid or tetraploid), the number of deletions were skewed in that genome’s favor. It was assumed they would all be approximately the same, but this is not the case. Adjusted for bias, the data confirmed our hypothesis: the subgenome that had lower expression contained on average more deletions in the promoter regions of biased genes. This revelation about reference the skew in the references has helped us do other analyses more accurately as well.
All of this data consists of observations, and once we have thoroughly investigated more potential causes of expression bias (region after the gene, splice regions, etc.), we can move on to experimentation. Promoter regions from each subgenome will be spliced onto green or yellow fluorescent protein genes, and inserted into Arabidopsis thaliana. Expression will be quantified based on fluorescence of both colors, to confirm that the promoter region is what truly influences the expression bias.
In summary, both the WGS and Sanger data identified deletions in the promoter region of one subgenome as a potential cause for lower expression than the other subgenome. The alignments to the different reference genomes is greatly skewed towards deletions of the opposite subgenome, and results must be adjusted accordingly. This information will be very useful with other studies done in the lab. To continue this experiment, the promoter regions will be transformed into A. thaliana, and expression quantified to confirm the promoter region as causing the effect. Once all the alternatives mentioned before have been quantified, we will have a much greater understanding of polyploidization, an important event in many plants’ history.
- Polyploidization events can occur either by the genomic data of a cell replicating itself (autotetraploid) or through the merging of the genomic data of two related species (allotetraploid), creating two “subgenomes” in the species, one from each of the diploid genomes. All of the cotton species used in this experiment were allotetraploid between the A and D diploid genomes.