Brian Holt and Dr. Dennis Tolley, Department of Statistics
Per the original ORCA proposal, work has been done to estimate relative amounts of compounds from GC-MS (gas chromatography-mass spectrometry) data using an asymmetric penalized likelihood function. The initial results of this project were presented at the CPMS Student Research Conference in March of this year1. The results consist of a simulation study where we simulate the problem of co-eluting, or overlapping, compounds and attempt to apply basic regression techniques as well as the asymmetric penalty function to see how they compare. Under certain conditions the new penalty function has less bias, but overall the function is unstable. There are many parameters involved that must be optimized which leads to large variation in estimates. Further work might involve optimizing these parameters based on some selection criteria.
Because of the instability of the asymmetric penalty function approach, another important aspect of the problem is being investigated, that is, the problem of GC-MS biomarker identification and selection. Extensive work has been done to identify important peaks in GC-MS output for differentiation among genetically “close” species. Current work involves creating an automated algorithm that can identify all possible peaks in many samples and compare them to identify the best possible peak or combination of peaks that can best differentiate the sample groups. An automated algorithm is needed since the volume of data is so large and human eyes cannot always detect important differences in large chromatograms.
This project is broken into three tasks. The first task is to uniformly and automatically identify peaks that will be comparable across all samples. For example, if one chromatogram has three peaks at times 3, 8, and 12, then we must measure at times 3, 8, and 12 in all chromatograms for comparability. This has proven to be difficult since there is a lot of noise in the chromatogram output which leads to false identification of peaks. Using a nonparametric counting method, we obtain relatively stable results in peak selection. Figure 1 below demonstrates the difficulty in identifying useful peaks. The second task is to apply supervised variable or feature selection techniques to determine which peak or combination of peaks is most important in sample differentiation. This is an ongoing project, but typical statistical classification methods such as k-nearest neighbors, random forests, stochastic gradient boosting, and support vector machines (SVM) are being used to determine the best way to apply the idea of variable selection to GC-MS data. These nonparametric methods are used often in automated machine learning tasks such as this and each has strengths and weaknesses. The simplest is k-nearest neighbors, which is used mainly as comparison, while random forests, boosting, and SVM are relatively new methods that have shown great success in contemporary problems.
The third task is to verify the validity of each potential biomarker obtained from the variable selection process in the analytical chemistry literature or with professional analytical chemists. Even if these potential biomarkers are verified by experts, the biomarkers must also be applied to test data to ascertain their robustness in a prediction situation. Some work has been done on the second and third task of identifying biomarkers and verifying them, but it is not automated and therefore is subject to human error and oversight. The goal is to simulate the intelligence of a professional analytical chemist in a fast, simple, and effective way to avoid human error. Work will continue on these tasks at least until graduation planned in April 2013.