Patrick A. Turley
Every mathematical model in the social sciences has its assumptions that may or may not be justified. These may be assumptions related to the behavior of the subjects of study (e.g. that people act in their best interests) or assumptions related to the model itself (e.g. that one variable will affect another linearly). A standard assumption in most regression models is that the error of the model is normally distributed, an assumption that doesn’t hold in most cases. In this research, the more flexible g-and-h distribution was examined as an alternative to the normal distribution.
The standard method for estimated regression models is Ordinary Least Squares (OLS), which as mentioned above, assumes that the data that is being modeled is normally distributed, a distribution that has a flexible mean and variance, but whose skewness (a measure of symmetry) and kurtosis (a measure of tail-thickness) are fixed at 0 and 3 respectively. It is noted that many data-sets have a wide variety of values for skewness and kurtosis, allowing for improvements over the OLS models.
Much research has been done to compensate for this problem by relaxing the assumption of normality, using semi parametric methods and flexible distributions that can account for different values of skewness and kurtosis. The g-and-h distribution, developed by Tukey (1977), has not been carefully examined despite the fact that it is one of the most flexible of all distributions used in statistical modeling. This is likely due to the fact that this distribution cannot be calculated directly because it is a noninvertible transformation of the normal distribution and therefore has no closed-form pdf.
Traditionally, parameter-estimation of the g-and-h has been conducted with a method based on order statistics developed by Hoaglin (1985), cf Dutta and Babbel (2005). Preliminary simulations, however, have shown that this method is not very efficient; in fact, order statistic estimates are not even consistent over a portion of the domain. My research discussed using numerical methods to estimate the parameters by method of moments and by maximum likelihood estimation. These different methods were then be compared to one another through Monte Carlo simulations and were applied in a practical way through a regression analysis of financial data from the Center for Research in Security Prices database.
In order to conduct this research, I wrote several programs in Matlab that could numerically evaluate the g-and-h distribution’s characteristics and that could compare the relative performance of the various methods of estimating the parameters of the g-and-h distribution and compare the performance of the g-and-h to other more common distributions at modeling data. The bulk of my time was spent writing and running these programs on the university’s supercomputer. The results of the Monte Carlo simulation seem to suggest that when a data set is in fact distributed as a g-and-h, maximum likelihood methods of estimation slightly out-performed method of moments estimation or quantile estimation. The stock returns application suggests that both maximum likelihood and quantile estimation outperform method of moments in modeling empirical data when the sample skewness and kurtosis are outside a certain range. When compared to other simpler distributions, the g-and-h performed as well but no better than several of the more flexible distributions tested.
The largest difficulty that we encountered was that early in our study of the g-and-h distribution, we realized that a large portion of this distribution’s flexibility stemmed from its capacity to produce U-shaped pdfs. In all of the applications that my mentor and I discussed, we could not think of a reasonable benefit of a distribution of this shape. When the g-and-h distribution was limited only to traditionally shaped distributions, its flexibility was reduced to that of the inverse hyperbolic sine distribution, another flexible distribution considered in our research. Given that the g-and-h distribution is significantly computationally more difficult to use and given that any flexibility advantages of the g-and-h are lost when it is restricted to practical parameterizations of the distribution, my mentor and I determined that the g-and-h distribution is not likely to have much practical significance.
Over the course of the last year, I’ve had opportunities to present this research at several venues. Last March I presented it at the Utah Academy of Sciences, Arts, and Letters Conference hosted by Brigham Young University. The following April, I made and presented a poster at the Mary Lou Fulton Mentored Learning Conference and received first place in my discipline. In August, my mentor and I presented our research at the Joint Statistical Meetings in Washington D.C. We have also written a paper that has been submitted for publication with the Journal of the American Statistical Association.
Sources
- Dutta, K.K. and D.F. Babel (2002). Extracting Probabilistic Information fro the Prices of Interested Rate Options: Test of Distributional Assumptions. The Journal of Business 78(3), 841-870.
- Hoaglin, D.C. (1985). In D.C. Hoaglin, F. Mosteller, and J.W. Tukey, eds., Exploring Data Tables, Trends, and Shapes, New York, NY: John Wiley and Sons, Inc, 417-460.
- Tukey, J. W. (1977). Exploratory Data Analysis. Reading, MA: Addison-Wesley