Frodsham, Scott
A Phenome Wide Association Study of Multiple Sclerosis and Comorbidities
Faculty Mentor: Davis, Mary, Department of Microbiology and Molecular Biology
Introduction
Genome wide association studies (GWAS) have identified relationships between many different
genes and diseases. GWAS studies scan whole genomes of many individuals and then
associate genetic variants with diseases that the individuals have. In contrast, a phenome wide
association study (PheWAS), looks at the phenotypes of many individuals and associates those
phenotypes with one or more genetic variants. Electronic medical records (EMRs) linked to DNA
biobanks provide both clinical and genetic data of patients. This study utilizes one such EMRlinked
DNA biobank called BioVU from Vanderbilt University.
Study Population
BioVU is a resource of over 180,000 leftover blood samples from outpatients that have been
collected and used to extract DNA, and each sample is linked to the individual’s clinical data
through a “synthetic derivative” of their de-identified EMRs. Patients in BioVU are representative
of all patients that come to VUMC, in that they come from diverse regions of the country, varied
ethnicities and health statuses, and are of all ages. We have limited our study to patients 18
years or older. EMR usage at Vanderbilt dates back to 1997, so we have over 10 years’ worth of
clinical information for many patients. The Vanderbilt MS Clinic was established in 1994 and up
to 30 patients are seen each day. The quantity and quality of data made available to us by
BioBU are large enough to provide significant results when doing statistical analysis on the data.
Data of insufficient quality were identified and removed using PLINK.
Methods
All samples were genotyped on the ImmunoChip. Patients diagnosed with MS (1,003) and 106
single nucleotide polymorphisms (SNPs) associated with risk of MS were extracted from our
parent dataset. Patients were identified with having a co-morbidity based on the presence of at
least one ICD-9 billing code. Logistic regression was performed with the first three principle
components included to adjust for population structure. The genome wide significant p-value of
5 x 10-8 was used to correct for multiple testing. Because MS occurs more frequently in
Caucasian populations and because a variety of ethnicities were present within our study
population, a covariate file was included in the analysis to account for this ethnicity. Data of
insufficient quality were identified and removed using PLINK.
Initially, we used ICD-9 billing codes and regular expressions to identify individuals diagnosed
with MS and rheumatoid arthritis (RA), a common comorbidity. In this study we analyzed to see
if any variants known to be risk factors for MS were also associated with simultaneous
development of RA. While no SNPs in the analysis passed the significance threshold, the
number of SNPs in close proximity to the significance threshold warranted further investigation
within a wider scope of phenotypes observed in MS patients. By studying a wider variety of
phenotypes, we were able to identify diseases with genetic variants that correlate only weakly to
an increased risk of development of MS. PheWAS analysis was performed using a package
within the statistical computing program, R. The study population consists of 1,003 individuals
diagnosed with MS, the primary phenotype of interest. ICD9 billing codes contained within the
EMRs identified the comorbidities or secondary phenotypes analyzed in this study. Using these
secondary phenotypes, the PheWAS was used to identify potentially significant genetic variants
within this population of MS patients. We separated the most significant secondary phenotypes
and displayed them as Manhattan plots.
Results and discussion
None of the p-values of the SNPs identified by the PheWAS reached the threshold for genome
wide significance (5 x 10-8). However, 18 SNPs had associative p-values. These 18 SNPs,
spanning 10 different phenotypes, each had corresponding p-values smaller than 8 x 10-6. The
ICD-9 billing codes associated with the most significant SNPs were 792, 452, and 81. The
phenotypes that correspond to these codes are “nonspecific abnormal findings in other body
substances”, “portal vein thrombosis”, and “other typhus” respectively. ICD-9 code 792 yielded
the SNP X1kg_13_98874890_A (p = 6.02 x 10-7, OR = 2.98). ICD-9 code 452 yielded the SNP
rs193772_A (p = 7.18 x 10-7, OR = 2.65). ICD-9 code 81 yielded the SNP
Imm_11_118254528_T (p = 8.93 x 10-7, OR = 3.71). The remaining 15 associative SNPs
represent a wide range of phenotypes. Only one of these phenotypes, diabetes, is a comorbidity
of high prevalence among MS patients. The SNPs corresponding to the diabetes phenotype did
not, however, reach genome wide significance.
Conclusion
When compared against the published literature, the most significant phenotypes identified in
this study are not strongly associated with the MS disease course or pathways. Further, none of
the SNPs identified by the PheWAS analysis had significant enough p-values to give reason to
believe that genetic variants associated with these phenotypes could result in an increased risk
of developing comorbidities commonly associated with MS. We cannot say that genetic variants
known to increase the likelihood of MS also increase the likelihood of any other comorbidity. A
larger cohort and more powerful analytical tools may, in the future, allow a clearer picture of the
genetic relationship between MS and its comorbidities.