Noel Ellison and Dr. Bruce Collings, Statistics
Tuberculosis (TB) is a contagious bacterial infection caused by mycobacterium Tuberculosis. While some people consider TB to be a “dead” disease that was eradicated years ago, it is, in fact, very much alive in the world today. It is estimated that more than two billion people are infected with TB bacilli1; this is approximately one third of the world’s population. There are many factors associated with the development of TB. Previous research links socioeconomic factors as well as HIB-TB co-infection with the TB epidemic. Countries that are poorer and have higher levels of HIV tend to have more cases of Tuberculosis. The question proposed is whether or not the percentage of a country that smokes can be useful in predicting the number of incident TB cases for that country. Do high smoking levels in a country effect the rate of TB? The second question has to do with the public health infrastructure in a country.
Does how well a country treats their current TB cases have any effect on their incidence of TB? All of the data necessary for this analysis could not be found in one location. It was necessary to compile information from three data sets. The WHO report Global Tuberculosis Control from 2011 was used to obtain the estimated number of incident cases of TB, HIV-TB cases, treatment outcomes, and population for each country in 2008. Someone with active TB is smear-positive: TB organisms are present in the patient’s sputum. Treatment outcomes for smear positive as well as re-treatment cases were considered. A re-treatment case is a patient who has already been treated for TB once before. This patient either defaulted (failed to complete treatment) or treatment was unsuccessful. Treatment outcomes are represented as the proportion of successes/failures among new cases (incident) of TB. The same was done to represent HIV-TB co-infection. Overall smoking percentages were taken for each country. These were gathered from the WHO: Report on the Global Tobacco Epidemic (2008). The website http://www.nationmaster.com had gross national income (GNI) per capita for each country.
Many analyses were performed prior to writing this report. As expected, there is a lot of variation in incidence cases of TB per country. For this analysis, only countries that have at least 4,500 incidence cases of TB will be included in analysis. Also, six countries in this group were abnormal and resulted in large scaled residuals (greater than 200). Models were fit to this adjusted data set that excluded countries that had fewer than 4,500 incident cases of TB and the six mentioned outlying observations.
There were multiple treatment outcomes that were highly correlated; any two of these factors that have a correlation of 0.6 or greater will cause problems with co-linearity. For this reason only one of the two treatment outcomes with correlation greater than 0.6 were included. The remaining factors are as follows: smear-positive completed, died, failed, defaulted, and retreatment completed. These factors are considered in addition to HIV, smoking percent and GNI.
The number of incident cases of TB can only exist in whole numbers, and can therefore be categorized as count data. The Poisson distribution was used to model this count data (generalized linear models, Poisson regression) in which the log of population will be included as an offset term. This is a known constant, which is easily incorporated into the estimation procedure. This will account for the effect of population on the number of incident TB cases. For Poisson regression, it is assumed that the variance of the distribution is equal to its mean; with incident TB cases the estimated variance of this distribution is much larger than its mean. A quasi-Poisson model was fit to account for this large variance. The coefficients and summary of the selected model is shown below.
Performing a χ2 test proves that this reduced model is just as adequate as the model that contains all of the variables mentioned above. While the percent of smear-positive cases that died is only marginally insignificant, the other factors are all significant. When this factor is dropped from the model, the proportion of re-treated cases that completed treatment also becomes insignificant. This may be due to the fact that there are unknown interactions. The residuals and qq-norm plot from this model show that the fit is adequate.
In the fitted values, the negative coefficients on GNI and treatment outcomes indicate that the increase of these factors causes a decrease in the log of incident TB cases. A positive coefficient indicates that any increase in this factor results in an increase in the log of incident cases of TB for a country. In conclusion, the more smear-positive (active) TB cases that die, the lower the log of incident cases of TB for that country. The higher the percent of re-treatment cases that complete treatment, the fewer the incident cases of TB for that country. The higher the proportion of HIV-TB co-infected cases of TB, the higher the number of incident cases of TB. The opposite is true of gross national income: the higher the income per capita for a country, the lower the number of incident TB cases for that country.
The model was fairly accurate in that it always predicted the correct number of digits. In the dataset, TB incidence numbers for each country were estimated and being inexact, it is difficult to evaluate the true accuracy of the model. Further model reduction indicated that HIV-TB coinfection and GNI per capita were statistically significant factors in predicting TB incidence, while treatment outcomes and smoking were not. However, it cannot be concluded that these factors do not have an effect on the spread of TB. The correlation between variables further complicates the model. To accurately analyze the effect that treatment outcomes and smoking have on TB, we must look at individual patients within a country. It would be more useful to look at the percent of those infected with TB that smoke instead of total population who smoke. This would accurately depict how smoking and TB are related. The variation inherent in multinational data also further lowers accuracy. Future analysis would regard the effects of smoking and treatment outcomes within a single country, looking at individual patients and outcomes.