Gibson, Kaitlin
Identifying Risk Factors for Interstate Crashes using Spatial Statistics
Faculty Mentor: Matthew Heaton, BYU Department of Statistics
Introduction
The goal of systemic highway safety improvement is to identify road characteristics,
called risk factors, associated with a higher prevalence of crashes, so that the roads can
be modified to avoid these characteristics. However, the statistical analyses these
improvement projects are based on generally use methods which are not appropriate
for the data. Our project implements novel spatial statistical methodology which is
appropriate and state-of-the-art to identify these risk factors.
Methodology
To accomplish these goals, we modeled data from the Highway Safety Information
System, a roadway database maintained by the Federal Highway Administration,
specifically the data from five interstate highways in Washington State in 2012. This
dataset contains information on the characteristics of the roads (speed limits, number of
lanes, etc.), as well as the locations of crashes. Since the response variable of interest
is the location of each crash on the road, we could not use standard regression
techniques. We instead treated each location as a realization from a nonhomogenous
spatial point process, which can be understood as an underlying function that controls
how many crashes occur at each location in our data. Since the intensity surface that
governs the point process is difficult to estimate in its true continuous form, we instead
split the roads up into one-mile segments and made the intensity surface discrete,
which makes estimation much simpler. We then sought to estimate λk, the parameter
which controls the intensity surface, and is the expected number of crashes for segment
k. Firstly, to control for traffic (areas with more traffic tend to have more crashes), we
defined μk = Ekλk where Ek is the crashes expected on segment k based on daily traffic.
Now, μk can be defined as the “relative risk” of segment k. If 0 < μk < 1, segment k had
fewer crashes than expected due to traffic and if μk > 1, segment k had more than
expected. Since Ek can be calculated from our data, we need only estimate μk in order
to estimate λk. Due to the fact the μk is constrained to be greater than 0, we performed
Gaussian process regression (a regression method which incorporates spatial
correlation) on the natural logarithm of μ, the vector of the μk’s for each road, as such:
log(μ) ~ N(Xβ,Σ)
Where X is a matrix containing the road characteristics for each segment and β is a
vector of the effects of these characteristics on the relative risk. Σ is the spatial
correlation matrix based on driving distance between road segments. It is important to
include this spatial piece, since road segments close to each other tend to be similar
and thus not independent of each other. Including Σ allows us to use that dependence
to make better estimates. We used Bayesian methods to obtain draws of the posterior
distributions of the β’s and the μk’s.
Results
The above table shows the posterior means of the β’s, which are estimates of the
effects of each characteristic, as well as the probability that the absolute value of these
effects are greater than 0. If the probability is high (close to 1), the characteristic most
likely has a
true impact on the relative risk. The β’s can be interpreted most easily by
their signs. For example, as the average curvature increases, since the estimated effect
is positive, the relative risk of the segment increases, on average, and there are more
crashes than expected given traffic. For left shoulder width, the posterior mean is
negative, so segments with wide left shoulders tend to have few crashes than expected
given traffic. The categorical variables (terrain, median, and speed limit) can be
interpreted compared to their baseline levels (level, soil median, and 70 MPH). Since
rolling terrain both has a positive posterior mean, segments with rolling terrain tend to
have higher relative risk than those with level terrain.
Discussion
Although we emphasize that our results are associations rather than causal
relationships, the roadway factors which our model identifies as being associated with
higher occurrence of crashes than expected are high levels of curvature, rolling or
mountainous terrain (compared to flat terrain), and speed limits of 65 MPH or lower
(compared to a baseline of 70 MPH). Factors associated with a lower occurrence of
crashes than expected are wide shoulders, wide lanes, and non-soil medians (such as
concrete barriers). These results make sense intuitively, except that lower speed limits
appear to be associated with more crashes. We hypothesize that some other underlying
factor confounded with lower speed limits is the true cause of this association, such as
pavement type or some sort of environmental factors.
Conclusion
In this project, we implemented a framework to use spatial statistical methods over a
roadway network to identify factors associated with high crash counts. This was
accomplished through using a non-homogeneous Poisson process incorporating spatial
correlation, based on driving distance, which is an appropriate method for the data.