Clarissa Farmer and E. Shannon Tass, Statistics
Genetic diagnosing is becoming more popular, as well as more and more accurate. However, many genetic diseases have complex genetic effects and are still not fully understood. Transthyretin Amyloidosis (ATTR; also known as familial or hereditary amyloidosis) is a terminal genetic disease. It is caused by unstable transthyretin proteins that fold improperly, and then deteriorate. The fragmented proteins are deposited outside of the cell and build up in the tissues over time, forming insoluble oligomers. The oligomers continue to grow into Amyloid fibrils, which adversely affect many organs in the body, eventually causing their failure. In order to accurately diagnose, doctors need to perform tissue and bone marrow biopsies, and both need to be positive for amyloid fibrils. Once those are positive, patients must be tested to make sure that these are fibrils caused by amyloids by ruling out all other possible diseases and causes. The main current treatment for ATTR is tafamidis. This drug stabilizes the correctly folded TTR protein and slows down the progression of amyloid fibril formation. Another option is organ transplantation, and usually both are recommended. The worst mutation is the Val30Met because it leads to fatality more quickly. Unfortunately, it is also the most common mutation.
Using genetics to diagnose patients is becoming more reliable and popular because it allows for precision medicine. It has many other advantages as well, including earlier diagnosis, lower overall costs, and more informed decisions. Because ATTR is so hard to diagnose, and current treatments are rudimentary, we wanted to better understand the genetics of this disease.
The data we used came from a study by Kurian et al (2016), which we will refer to as the Portugal study. Kurian et al. found a sex-independent molecular signature relating symptomatic patients and their response to tafamidis treatment. Our research aimed to further validate the Portugal study by using machine learning techniques to compare patients who were carriers but did not yet show symptoms versus controls.
Several studies have discussed the efficacy of the Random Forest algorithm when analyzing microarray data (Diaz 2006; Wu 2008; Moorthy 2011; Ram 2017). A few of the many reasons Random Forest works well for microarray data are: it works well with data that has many more variables than observations, it calculates the most important variables in the model, and stays robust even when most of the variables only contribute random noise. These are all important considerations when working with microarray data because overfitting to random noise can give inaccurate results.
We assessed the ability of the gene expression levels to predict whether an individual has the disease and is not showing physical symptoms (Asymptomatic) or is healthy (Control).
The dataset included 309 patients classified as either asymptomatic (V30M carrier), symptomatic, treated with tafamidis, or a healthy control (age and sex-matched). The data contained each patient’s gender and gene expression level for 20,273 genes. We combined this expression data with metadata from the Ensembl Biomart website in order to link the genes in our dataset to specific chromosomes. Sample sizes for each group are: Asymptomatic (N=87), Symptomatic (N=96), Treated (N=46), and Control (N=80).
We looked at only the asymptomatic patients and the healthy controls. We built two Random Forest models which would use the gene expression levels to predict whether a patient was asymptomatic or a control. In order to get more reliable results, we split our data into a training and test set (50/50). Using the training data, we fit a Random Forest model. We then tested the model’s ability to classify patients using the test data. For the second model, we randomized the type so that it was no longer associated with the expression levels. With this model, we could show whether our first model performed better than guessing and that there is an association between gene expression levels and type.
We found that our Random Forest model had a sensitivity of 68.7% and a specificity of 53.8%. Using the randomized type model, the sensitivity was 46.0% and the specificity was 37.4%.
This study was limited because it was only an exploratory analysis. We did not get other data in which to validate, however, we have 14 other microarray datasets for various cancer types, which we are currently using the same methods to assess.
This type of analysis is one of the best ways to analyze microarray data and we can better understand ATTR using it. These results are evidence that the genetics behind ATTR can be figured out and applied clinically. Early diagnosis and precision medicine is possible for this disease and could save countless lives.
Figure 1 – The Sensitivity of the two Random Forest models. In other words, the rate of correctly predicting a patient as asymptomatic.
Figure 2 – The Specificity of the two Random Forest models. In other words, the rate of correctly predicting a patient as a control.