Brian H. Boyer and Dr. James B. McDonald, Economics
Regression analysis is a technique routinely used by researchers in many disciplines to fit some type of mathematical model to observed data. A basic two dimensional linear regression model is mathematically expressed as yi = + xi + Ji for i = 1, n, where y1 … Yn is an observed sample of n data points on the dependent variable y, x1 . . . xn is an observed sample of n data points on an explanatory variable, x, and the parameters and define the true linear relationship between x and Y. The variable J represents a random disturbance term that is assumed to be generated by some probability distribution with zero mean. Some estimation technique is applied to the observed data, x and y, to obtain estimates of and , designated a and b respectively. For a given sample of x’s , we can imagine collecting several different samples of y’s that would each produce slightly different estimates of and . Hence, a and b can be considered variables that fluctuate over a given space on the number line (random variables). An estimation technique is efficient if it produces estimates, a and b, that have the smallest possible variance.
The most popular estimation technique, known as least squares (LS), is efficient if, among other necessary requirements, the error terms are normally distributed. Real data however, are often replete with outliers that are the result of data contamination (from mistakes in collecting and recording the data), model mis-specification, or error probability distributions that are truly non-normal. Whatever their source, outliers invalidate the traditional assumption of normal errors and cause LS to become inefficient. Unfortunately, many researchers try to “clean” data by merely discarding those observations which lie far from the LS line. Since LS often hides, or masks outlying observations, this cleaning procedure often results in deleting “good” data points rather than outliers.
Two fairly recent proposals of outlier resistant estimation techniques include partially adaptive estimation (McDonald and Newey, 1988), and least median squares (LMS) (Roussceuw and Leroy, 1987). Partially adaptive estimators “glean” information from outliers to provide very efficient estimates of regression models with error distributions that are truly non-normal. LMS in comparison, is more analogous to a “thrash and burn” method, and in essence, determines the linear pattern established by the majority of the data while ignoring all other observations.
The objective of my research was (1) to compare how efficient three partially adaptive estimators perform relative to LMS and a weighted version of LMS (RLMS), and (2) to compare the effect of data contamination on these estimation techniques. I first generated several data samples with various non-normal error distributions, and fit a regression line to each sample using partially adaptive estimators, LS, LMS, and RLMS. I then compared the spread of each estimator’s b to gain a better understanding of how efficient each of these estimation techniques are relative to each other. When the efforts were normal, LS was the most efficient, as expected, but the improvement over the most efficient partially adaptive estimator was found to be only 3%. On the other hand, partially adaptive techniques outperformed LS in every non-normal case by at least 14% and as much as 82%. The improvement in efficiency of RLMS over LMS for normal effors was found to be an impressive 49%. However, partially adaptive techniques outperformed RLMS for every error distribution considered by at least 8% and as much as 55%.
Next, I generated a data sample with normal efforts, and then gradually contaminated the sample with “dirty” data to determine how much contamination each estimator could tolerate. Breakdown plots were then constructed, which plot each estimator’s b as a function of the percentage of contamination in the data set. RLMS was found to resist up to 50% data contamination, as was found by Rousseeuw and Leroy (1987). In comparison, the BT, GT, and EGB2 partially adaptive estimators were found to resist 14%, 1 1 %, and IO% contamination respectively, while LS could barely tolerate 1 % contamination.
In summary, when the error term of a linear regression model deviates even slightly from normal, least squares becomes inefficient and, further, tends to mask outlying observations. Of the alternative estimation procedures considered in this paper, partially adaptive estimators appear to provide the most efficient estimates, and even perform well when the efforts are extremely fat tailed or skewed. However, RLMS was found to be the most effective in resisting data contamination, and hence, is considered least likely to mask the presence of outliers.
References
- McDonald, J. B., and W. K. Newey (1988). Partially adaptive estimation of regression models via the generalized t distribution, 1
Econometric Theory, 4, 428-57. - Rousseeuw, P. J., and A. M. Leroy (1987). Robust Regression and Outlier Detection, (New York: John Wiley and Sons).