David Boyack Dahl and Dr. Scott D. Grimshaw, Statistics
Classification is the assignment of objects to categories using a decision rule based on observed characteristics. Several classification techniques are available to form the decision rule. Each builds the decision rule by modeling the relationship between the actual categories and observed characteristics. The decision rule can then be used to classify other objects for which the true category is unknown. Three popular classification techniques are logistic regression, discriminant analysis, and classification trees. While all three of these classification techniques seem to work with some success, the literature has not come to uniformly accept one technique.
Classification techniques can be evaluated according to the following four criteria: interpretability, predictability, robustness, and ease of use. An interpretable technique provides the ability to understand the role that specified characteristics play in the decision rule. A technique with high predictive accuracy almost always correctly classifies an object given observed characteristics. A robust technique provides reliable results under a variety of situations. Finally, easy-to-use classification techniques are those whose methodology is intuitive and accessible in the software.
The health insurance industry provides an example in classification. Each U.S. citizen falls into one of two health insurance categories: insured or uninsured. Using a classification technique, a decision rule can be formed based on demographic information. Logistic regression, discriminant analysis, and classification trees were used in the context of health insurance coverage.
Logistic regression is an adaptation of least-squares regression that describes how the probability of health insurance coverage is related to a linear combination of the demographic variables. The decision rule is: If the estimated probability of having health insurance for a given observation is greater than 0.50, then the predicted classification for that observation is insured. Otherwise, the predicted classification is uninsured.
Discriminant analysis forms a linear function of the demographic variables which is used to split the data into insured and uninsured groups. The linear function is obtained by minimizing the misclassification rate. The decision rule is: If the demographic variables of an observation yields a high value of the discriminant function, then the predicted classification of the observation is insured. Otherwise, the predicted classification is uninsured.
Classification trees form successive decisions which partition the data into homogenous terminal nodes. Initially, all of the observations are together in one node. Then, the demographic variable and cut point that best partition the node into two more homogenous nodes is selected. If the observed value of a variable is greater than the cut point, the observation goes into one node, otherwise it goes into the other node. These nodes are repeatedly split in the same manner to form a large tree that is pruned using cross-validation to obtain the final decision rule. The decision rule is: If the insured category dominates a terminal node, then all observations that fall in that terminal nodes are predicted to be insured. Otherwise, the predicted classification for observations in that terminal node is uninsured.
A subset of the March 1995 Current Population Survey was used for this research. From the original CPS dataset, three subset datasets were created. The natural training dataset contained 700 randomly selected observations, of which 86% had health insurance. The even training dataset also contained 700 observations, but was randomly sampled in such a way that 50% had health insurance. Finally, the remaining observations made up the validation dataset, having in all 96,374 observations of which 86% had health insurance.
Among the three classification techniques, classification trees are the most interpretable because its successive decisions are easy for all to understand. Logistic regression and discriminant analysis require an understanding of linear relationships and probability distributions familiar to most statisticians, but not commonly understood by those with less statistical training.
The best illustration of the predictability model evaluation criteria is given by examining the misclassification rates of the three techniques. Each model was fit using the demographic information from the even training dataset and predicted classifications were obtained for each classification technique. Misclassification rates were calculated by comparing the predicted and actual classifications. Surprisingly, the misclassification rates were all very similar. Classification trees had the lowest misclassification rate (29.76%), but discriminant analysis and logistic regression were not far behind (30.49% and 31.02%, respectively). While the misclassification rates are statistically different, the slight differences may not be of practical importance in many situations.
The best illustration of the robustness model evaluation criteria is given by examining the misclassification rates of the three techniques, fit with the natural training dataset and verified with the validation dataset. The same procedure as described above was performed, except that the model was fit with the natural training dataset. The misclassification rates greatly changed. At first glance it appeared that logistic regression and classification trees performed the best because their misclassification rates were near 14%. Upon further inspection, it was discovered that these two classification models were, without exception, predicting the insured category, regardless of the values of the demographic variables. Since 14% of the observations in the validation dataset were uninsured, the misclassification rate was 14%. Clearly such a situation is unacceptable, because the model is doing no better than the naive estimator of always predicting insured. Discriminant analysis was the only classification technique that did not exhibit this gross bias towards a false-positive classification and hence is the clear winner for the robustness criteria.
No clear winner was found for the ease-of-use model evaluation criteria. Logistic regression and discriminant analysis are easy implemented in SAS. However, they are not as easily applied without a computer to perform the necessary calculations. Classification trees have the opposite problem: They cannot yet be performed in SAS, although once they are generated, they are very easy to use.