Gibson, Kaitlin

## Identifying Risk Factors for Interstate Crashes using Spatial Statistics

Faculty Mentor: Matthew Heaton, BYU Department of Statistics

### Introduction

The goal of systemic highway safety improvement is to identify road characteristics,

called risk factors, associated with a higher prevalence of crashes, so that the roads can

be modified to avoid these characteristics. However, the statistical analyses these

improvement projects are based on generally use methods which are not appropriate

for the data. Our project implements novel spatial statistical methodology which is

appropriate and state-of-the-art to identify these risk factors.

### Methodology

To accomplish these goals, we modeled data from the Highway Safety Information

System, a roadway database maintained by the Federal Highway Administration,

specifically the data from five interstate highways in Washington State in 2012. This

dataset contains information on the characteristics of the roads (speed limits, number of

lanes, etc.), as well as the locations of crashes. Since the response variable of interest

is the location of each crash on the road, we could not use standard regression

techniques. We instead treated each location as a realization from a nonhomogenous

spatial point process, which can be understood as an underlying function that controls

how many crashes occur at each location in our data. Since the intensity surface that

governs the point process is difficult to estimate in its true continuous form, we instead

split the roads up into one-mile segments and made the intensity surface discrete,

which makes estimation much simpler. We then sought to estimate λk, the parameter

which controls the intensity surface, and is the expected number of crashes for segment

k. Firstly, to control for traffic (areas with more traffic tend to have more crashes), we

defined μk = Ekλk where Ek is the crashes expected on segment k based on daily traffic.

Now, μk can be defined as the “relative risk” of segment k. If 0 < μk < 1, segment k had

fewer crashes than expected due to traffic and if μk > 1, segment k had more than

expected. Since Ek can be calculated from our data, we need only estimate μk in order

to estimate λk. Due to the fact the μk is constrained to be greater than 0, we performed

Gaussian process regression (a regression method which incorporates spatial

correlation) on the natural logarithm of μ, the vector of the μk’s for each road, as such:

log(μ) ~ N(Xβ,Σ)

Where X is a matrix containing the road characteristics for each segment and β is a

vector of the effects of these characteristics on the relative risk. Σ is the spatial

correlation matrix based on driving distance between road segments. It is important to

include this spatial piece, since road segments close to each other tend to be similar

and thus not independent of each other. Including Σ allows us to use that dependence

to make better estimates. We used Bayesian methods to obtain draws of the posterior

distributions of the β’s and the μk’s.

### Results

The above table shows the posterior means of the β’s, which are estimates of the

effects of each characteristic, as well as the probability that the absolute value of these

effects are greater than 0. If the probability is high (close to 1), the characteristic most

likely has a

true impact on the relative risk. The β’s can be interpreted most easily by

their signs. For example, as the average curvature increases, since the estimated effect

is positive, the relative risk of the segment increases, on average, and there are more

crashes than expected given traffic. For left shoulder width, the posterior mean is

negative, so segments with wide left shoulders tend to have few crashes than expected

given traffic. The categorical variables (terrain, median, and speed limit) can be

interpreted compared to their baseline levels (level, soil median, and 70 MPH). Since

rolling terrain both has a positive posterior mean, segments with rolling terrain tend to

have higher relative risk than those with level terrain.

### Discussion

Although we emphasize that our results are associations rather than causal

relationships, the roadway factors which our model identifies as being associated with

higher occurrence of crashes than expected are high levels of curvature, rolling or

mountainous terrain (compared to flat terrain), and speed limits of 65 MPH or lower

(compared to a baseline of 70 MPH). Factors associated with a lower occurrence of

crashes than expected are wide shoulders, wide lanes, and non-soil medians (such as

concrete barriers). These results make sense intuitively, except that lower speed limits

appear to be associated with more crashes. We hypothesize that some other underlying

factor confounded with lower speed limits is the true cause of this association, such as

pavement type or some sort of environmental factors.

### Conclusion

In this project, we implemented a framework to use spatial statistical methods over a

roadway network to identify factors associated with high crash counts. This was

accomplished through using a non-homogeneous Poisson process incorporating spatial

correlation, based on driving distance, which is an appropriate method for the data.