Increasing the Accuracy of Molecular Biomarkers via Evidence-based Algorithm Selection

Stephen R. Piccolo, Biology

This is a final report for a Mentoring Environment Grant that Brigham Young University awarded to me in 2016. Below is a summary of the project that this grant enabled students in my research lab to perform, as well as information about how the funds were used.

Research Project

In making medical decisions, physicians need observable criteria that they can use to make accurate diagnoses, determine optimal treatments, and estimate a patient’s prognosis. The promise of precision medicine is that molecular-level observations can more accurately predict such information than traditional observations. I worked with undergraduate students at Brigham Young University to develop ways to improve the accuracy of such predictions by optimizing computer algorithms to handle the complexity of large, molecular data sets.

We collected 50 gene-expression datasets from the public domain. Each dataset contained 10,000+ gene-expression measurements for at least 50 patients, most of whom had developed some form of cancer. Each dataset characterized at least two distinct categories of patient (for example, patients who responded well to a given chemotherapy treatment and those who did not respond well to the chemotherapy treatment). We used 50 different classification algorithms—developed by machinelearning researchers—to evaluate our ability to distinguish between patients who fell into these groups based on 1) their gene-expression profile, 2) their clinical and demographic characteristics, or 3) both. Our goals were to evaluate what levels of accuracy we could attain using these algorithms and to identify which algorithms perform particular well at distinguishing such groups. We found that some algorithms— especially kernel-based methods and resampling-based methods—performed consistently better, on average, than alternative methods. As might be expected, we also found that some types of groups were easier to classify than other groups. For example, it was relatively easy to distinguish between individuals who harbored (or not) a certain type of (non-gene expression) molecular marker than it was to distinguish between individuals who would survive a relatively short or long period of time.

We used cloud-computing servers, BYU’s Fulton Supercomputing Lab, and computers in the Piccolo lab to execute this large-scale analysis. In doing so, students gained experience working with real-world research data, an understanding of machinelearning algorithms and how to apply them, experience with high-performance computing environments, and practice doing research and reporting on it. Three research papers describing these efforts are in draft phase and will soon be submitted to peer-reviewed journals. In addition, an open-source software tool called ShinyLearner has been developed and is available in the public domain. These projects also have open new avenues of discovery for future projects and grant proposals that the Piccolo Lab will pursue.

Mentoring Approach

A critical key to my success as a researcher at BYU is to employ undergraduate students’ bioinformatics skills. Accordingly, I used 75% of the funds from this grant to pay undergraduate student salaries. I used the remaining funds for conference travel and supplies.

As a mentor, I aim to strike a balance between providing careful oversight and fostering independence. Except in unusual circumstances, I meet with each of my students, one on one, at least once per week. At the beginning of a semester, we schedule a recurring appointment so we can meet to discuss the student’s project. These meetings have been incredibly effective at enabling students to make steady progress, helping them to work through any barriers they are facing, and fostering relationships of trust with the students. I believe these meetings are responsible, in part, for low turnover rates in my lab. In addition to this close mentoring, I prioritize students’ development as independent researchers. Each student is given responsibility for a distinct, individual project—or sometimes for a distinct aspect of a larger project. Accordingly, students gain experience taking ownership of their own work and developing traits that will help them succeed post-graduation, whether that be in additional schooling, in employment, or at home. Before students join my lab, I emphasize that their primary objective should be to produce tangible research outputs—especially papers and presentations. Although research experience has ephemeral value on its own, tangible outputs can be placed on a résumé and make students considerably more competitive for graduate school or employment.

Students Who Benefitted from This Grant

The following list indicates which students benefitted from the funds that were provided by this grant, as well as a brief description of the role(s) they played:

Anna Guyer – Helped compile initial datasets and performed initial benchmark analysis
Nathan Golightly – Refined methods for biocuration, collected initial datasets, and performed key aspects of the benchmark analysis
Avery Bell – Further refined methods for biocuration, prepared figures illustrating results of benchmark analysis, developed user-friendly tools to enable other researchers to perform analyses similar to ours.
James Lee – Helped to develop ShinyLearner, an open-source software tool for performing large-scale machine-learning benchmarks at the command line.
Alyssa Parker – Developed an open-source software tool to curate biomedical data.
Erica Suh – Refined the functionality of ShinyLearner.

Deliverables

Golightly, N. P., Bell, A., Bischoff, A. I., Hollingsworth, P. D., & Piccolo, S. R. (2018). Curated compendium of human transcriptional biomarker data. Scientific Data, 5, 180066. https://doi.org/10.1038/sdata.2018.66
ShinyLearner: https://github.com/srp33/ShinyLearner
Anticipated: Paper published in peer-reviewed journal about ShinyLearner
Anticipated: Paper published in peer-reviewed journal about benchmark results
Anticipated: Paper published in peer-reviewed journal about GEODE and Good Nomen, tools that are under development for curating gene-expression datasets.