Brad Larsen and Dr. James McDonald, Economics
Qualitative response (QR) models are used by economists, biometricians, epidemiologists, statisticians, and others to estimate the effect of certain variables on a binary (“yes” or “no”) response. For example, economists use QR models to answer the question, “What factors influence a woman’s decision to work? For years, biometricians have used QR models to answer questions such as, “What drug dosage will help a patient survive?”, or, more recently, “What level of magnetic radiation will result in childhood leukemia?” By answering these questions, QR models provide policy-makers, businesses, doctors, and others with valuable information for making decisions.
Although these models help to predict outcomes, their validity depends on certain distributional assumptions. The most common QR models are the logit and the probit. The logit model is based on the “log-logistic” distribution and the probit model assumes a normal distribution. In some cases, logit or probit provide good models for the data. However, in other cases, a more flexible distribution, such as the exponential generalized beta of the second kind (EGB2), may be more accurate. For example, a biometrician might use a logit model to predict the proper dose of a drug to help a patient survive. If the data were actually distributed as an EGB2, however, the logit results could contain more prediction errors (such as patients receiving too much or not enough of the drug) than would a model which used the EGB2 as the underlying distribution. Log-likelihood ratio tests and other goodness-of-fit tests can show the improved fit of more flexible distributions over the normal and log-logistic.
Our project consisted in designing a user-friendly program written in Matlab (a matrix computing language) which allows the user to estimate QR models using a variety of distributions, or generalized qualitative response (GQR) models. My faculty mentor, James McDonald of the Economics Department, researches the benefits of flexible distributions. Dr. McDonald had a previous version of the GQR program written in FORTRAN (an older programming language). I used this older program, as well a skeleton version of the Matlab program, to design our new GQR program.
The program, now nearly complete, allows users to provide a dataset and estimate a variety of GQR models with such distributions as the EGB2 and Skewed Generalized T. The program returns various statistics which help the user determine the goodness-of-fit for a specific model. For instance, the program returns a prediction matrix, which displays the number of “yes” and “no” outcomes in the data and the percent of outcomes which the model would predict to be “yes” and “no.” With this matrix, the user can determine the percent of outcomes which the model predicts correctly.
The program also allows the user to estimate “hold-out” samples. Hold-out sampling is performed by estimating the model with a certain percent of the data “held-out.” For example, ten percent of the data is held-out and the model is estimated with the other ninety percent. The parameters of the estimated model are then used to predict the outcome for the observations which were held-out and the number of correct predictions is determined. This process is repeated ten times holding out a different ten percent each time, and the average of the percent correctly predicted is estimated. In this way, the user can determine how well the model would actually predict outcomes other than those observed in the data.
In order to apply our research and determine the effectiveness of GQR models, we tested several binary datasets, such as a simple dataset containing GRE exam scores for students and information on whether or not the students were accepted into graduate school. Our most extensive application, however, involved a study in conjunction with professors from the Marriott School, including Steve Albrecht, Conan Albrecht, and James Hansen. The study incorporated data on management fraud, with financial ratios of various real-world and simulated firms, as well as a binary variable showing whether or not the firms were known to have committed management fraud. Thus the purpose of the study was to answer the question, “If I know something about a particular company, such as how much profit the company is making and how well its stock is performing, can I predict whether or not the company is actually committing fraud?”
The actual estimation was very time-consuming because of the size of the dataset and we found the Fulton Supercomputers invaluable in speeding up the process. Our study found that GQR models could predict fraud with 15 percent more accuracy than traditional logit and probit models. The results have been prepared in a paper which is currently under revision for Management Science. In addition, I prepared a poster and presented the project results at the Mary Lou Fulton Undergraduate Research Conference in April 2007.
We are still experimenting with several features in the program and further extensions of the research. For example, the user can choose between various estimates of standard errors as well as various assumed forms of variance in the data (termed “heteroskedasticity” by econometricians). The programming of these features was quite time-consuming as we worked to make the code as simple and fast as possible. We found that the estimates of standard errors do not vary much one from the other. We are still experimenting with various forms of heteroskedasticity assumptions. Another feature of the program with which we are still experimenting is that of choice-based sampling, which is useful when the user knows that a particular dataset may have a different percent of “yes” and “no” outcomes than the population from which it was drawn.
This project has been exciting and educational. The project has forced me to go beyond the pre-programmed models in statistical packages such as Stata or SAS because I have had to program the models from scratch. This process has taught me to understand GQR models much more extensively than I would have otherwise. The results of our fraud application are interesting and could lead to more accurate models for fraud investigators. I am eager to continue the research and grateful for the help provided by the ORCA grant.