Ben Saville and Dr. Lara Wolfson, Statistics
In assessing the performance of schools in Utah (or any other state), a critical component of valid assessment of school performance is hard data. In areas such as student performance, that information is readily available from test scores; but in understanding the job burden of teachers in the Utah educational system, very little hard data is available. I proposed to design a sample survey to address issues in which the state of Utah is “data deficient”, impairing the state’s ability to make well-informed decisions about improving working conditions for teachers. Some issues to be addressed include: the amount of money teachers spend out of their pocket on classroom materials, the amount of parent involvement in classrooms, convenient housing options for teachers, and the amount of hours teachers work outside of the classroom.
My goal of the project is to design a sample frame that will help answer these questions and provide insight to other issues regarding Utah education. My efforts will be focused on the statistical aspects and theory of sampling, as I will leave the actual development of questions to the Utah State Office of Education.
The basic idea of sampling is to select some type of random sample of teachers to question, and use the results to estimate the statistics of the whole population of Utah teachers. Although the sample survey will not identify the exact characteristics of the population, one can statistically calculate how accurate the results are by a “margin of error”. The margin of error is how close the sample data is to the characteristics of the population. The design I will use is a multi-staged stratified sample that blocks homogeneous groups together at different levels in the design. This design incorporates a variety of multi-leveled samples and types of grouping that will allow more precise comparisons between counties, school districts, and schools.
There are three main parts of this study. The first and most time-consuming task is to develop two large databases containing information on possible influential factors. The second is to determine the stratifications and groupings necessary to make valid comparisons. Third, I need to calculate the sample size required, and write the actual SAS code to determine the number of teachers to be sampled from various schools across the state.
The two databases will be based on different levels: one at the school district level, and the other at the individual school level. Creating the databases was a frustrating and challenging task, as I vastly underestimated the amount of time and effort needed to collect the necessary data. This process took the majority of my time this semester. I believe the project was worthwhile for me simply based on this part of the study. In academic settings, “clean” data is usually handed to students studying statistics, meaning the data is ready to be analyzed with statistical software. Little experience is actually gained gathering data or dealing with “messy” data. However, in the real world (i.e. industrial jobs) “clean” data is rarely handed to statisticians. Having had experience with collecting and organizing a database, I am better prepared to be a successful statistician in the industrial world.
All of the data were collected via the internet. The databases are at two different levels, based on either the school district or school level. Charter schools, private schools, and special education schools were ignored for this study. A great deal of information was collected from the 2001 Common Core of Data files (CCD), a database managed by the National Center for Education Statistics. Other sources were the Financial and Statistical Data Files (Annual Statistical and Financial Report 2000-2001), the Education Finance Statistics Center, and Proximity’s Census 2000 Summary Demographics Table. The data are based on Census data from various years, ranging from 1990-2000. When all the data were gathered, I had over 50 datasets with hundreds of different variables. I tediously went through each data set and flagged certain variables that I thought might be influential in my study. I then merged all the important variables into two databases. The school district database contains 114 variables, and the school database contains 46 variables. Each one of these variables has been documented, so that I can trace the source, including the website and file that was downloaded off the internet. Some examples of types of variables included are student enrollments (by grade, gender, and ethnicity), number of teachers, median household income, minority percentages, expenditures per pupil, and average teacher salaries.
Presently, I am looking at the data determining the best ways to stratify the sample. One stratification will likely be rural vs. urban schools. This will allow me to compare the difference in teachers between these two groups, despite the fact that there are many more teachers in urban areas than in rural teachers. Other possible stratification variables include boundaries of school districts, grade levels of teachers, student-teacher ratios, student growth rate, community household income levels, and ethnicity percentages.
Once I decide on the actual sampling design, I will use laws of probability to develop a sampling scheme to accurately reflect the population. Since groups will be stratified based on several factors, I must take that into account in order for the sample to reflect the population. I will use a technique called “sample weights” to account for this challenge. Sample weights will give certain groups of teachers different chances of being selected based upon the population of their region.
One of the most difficult problems of surveys is the non-response rate, or the percentage of people that do not give feedback when contacted. In order to obtain the desired amount of feedback, I must compensate for non-response by statistically estimating the projected rate of non-response and then taking a larger sample. If I do not compensate for non-response, I will not acquire enough data from the different demographical regions of the state to make effective comparisons and conclusions.
Thus far I have enjoyed the challenge of developing a sample survey. There are many aspects to consider in designing such a survey, more than I had originally suspected. This project has helped me develop new skills and gain a deeper understanding (and appreciation) of sample surveys. I will continue to work on the remaining portions of this study, and hope to employ a successful and accurate sample survey that will benefit the Utah State Office of Education and Utah teachers.