Carla Johnston and Dr. James McDonald, Economics Department
Introduction
Does happiness depend on income? What puts people at risk to become “heavy smokers?” Do gender and wage affect job promotion? The answers to these varied questions have one thing in common: they employ grouped or categorical data. Happiness is often reported on cales of 1 to 10 (Winkelmann 2005). Tobacco users and cigarette smokers are asked if they are “non-users,” “light users,” or “heavy users” (Harris and Zhao 2007). In some professions, such as the British nursing field, careers are assigned ranks from one to six Pudney and Shields 2000). Categsorization often cannot be avoided when collecting data. The nature of this categorical data should be taken into account when seeking causal relationships between the categorical variable of interest, such as happiness, and explanatory variables, such as income and education. In the past, statisticians and academics have used a regression technique called the ordered probit to estimate relationships between explanatory variables and grouped data. Although this method presents a better option than the standard ordinary least squares technique, both the ordered probit and the ordered logit make very restrictive assumptions about the distribution of the grouped data. Unfortunately, these theoretic assumptions about the data distribution often do not hold in actual data and so we find ourselves with an inconsistency between the data and the model. This inconsistency is similar to buying an 11 year-old nephew a size six shoe under the assumption that all 11 year olds share the same, average shoe size. To your chagrin, your nephew has size eight feet and the shoes suddenly lose much of their forecasted usefulness. The purpose of this project is to remove this inconsistency in previous regression models or at least reduce it to a lesser degree; I am trying to find the right size shoe for each set of data in order to make estimation results more reliable. I do this by incorporating more information about the data distribution when estimating the relationships of interest.
Methodology
We first had to write an estimation program that incorporated more information about the data. We wrote the program in Python, a high-level programming language. We first tested our hypothesis on simulated data. The results on the simulated data were promising; incorporating more information about the data led to a better estimation of the independent variable’s effect on the dependent variable. This incorporation of data mean our estimation procedures had to be more complex than the standard ordered probit and ordered logit procedures. For our real-world application we used the World Values Survey data, waves 1-5. The first wave of survey data began in 1981 and the last one ended in 2009. The whole dataset contained over 100,000 observations, but we restricted our analysis to a random sample of 10,000 observations.
Our dependent variable was life satisfaction, which is measured on a scale of 1 to 10, 1 indicating “very dissatisfied,” and 10 indicated “extremely satisfied.” We wanted to measure the effects income, gender, marital status, age, education and religious activity on life satisfaction. We ran four separate analyses on the data. The first two analyses were the ordered probit and another similar, basic estimation procedure. The last two analyses were methods we had programmed in Python that allowed for more complexity in the estimation framework.
Results
The two more complicated estimation procedures yielded better results than the basic procedures. A statistical test measuring the improvement of the results indicated a rejection of the hypothesis that no significant improvement was made. The results from the four procedures are shown in Table 1.
Table 1: Normal and Laplace refer to the basic estimation procedures. Slapace and SGED refer to the more complicated estimation procedures. In a loose sense, the numbers connected to each variable translate to the effect that increasing that variable (for example, age) would have on the life satisfaction measure.
Discussion
From our results it appears that allowing for more complexity in the estimation framework does yield improved results. Interestingly, from this empirical application it appears that once a high enough level of complexity if attained, further more involved estimation procedures do not yield significant amounts of improvement. Thus it appears after a certain threshold, performing more involved and complex estimation procedures may not be worth it considering computation time.