To gain more experience with Hadley Wickham’s tidyverse, I’m going to look at a dataset of US traffic fatalities occurring in 2015, as compiled by the National Highway Traffic Safety Administration (NHTSA). I found this dataset on the Kaggle website.
The file accident.csv used in this post contains 52 different fields pertaining to 32,166 fatal automobile accidents. Data field specifications can be found in the Analytical User’s Manual.
Let’s begin with something very simple: the number of fatal accidents by day of the week. Ex ante, I expect that most such accidents would occur on Friday and Saturday, which seem to be the days with the heaviest traffic — especially in the evening hours. Here’s my code to generate a bar chart:
library("tidyverse") accidents <- read_csv("accident.csv") ggplot(data=accidents) + geom_bar(mapping=aes(x=DAY_WEEK),color="deepskyblue2",fill="deepskyblue2") + scale_x_continuous(breaks = 1:7,labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")) + labs(x="Day of Week",y="Number of Accidents") + ggtitle("2015 Fatal Traffic Accidents")
Of course, ggplot produces a perfectly serviceable bar chart with just the following:
ggplot(data=accidents) + geom_bar(mapping=aes(x=DAY_WEEK))
…but I couldn’t resist tweaking it. I wanted to see the days of the week written out on the x-axis rather than the numbers 1 to 7. I also added a title and changed the default bar colors. Here’s the plot:
At first I thought there was a mistake somewhere, because it seemed odd that there would be almost as many fatal accidents on Sunday as on Saturday. Sunday is a pretty quiet day in my experience, with perhaps the lightest traffic of any day of the week. But of course, Saturday night continues into Sunday morning, and that may be the effect we’re seeing.
To investigate further, I made a heat map of accidents by day of the week and hour of the day. I used the RColorBrewer package to define the palette for the heat map because I don’t like the default values. A few additional tweaks got the figure looking the way I wanted it to:
library(RColorBrewer) hm.palette <- colorRampPalette(brewer.pal(9, 'YlOrRd'), space='Lab') dayTime <- accidents %>% filter(HOUR != 99) %>% group_by(DAY_WEEK,HOUR) %>% summarize(totl = n()) ggplot(dayTime,aes(x=HOUR,y=DAY_WEEK,fill=totl)) + geom_tile() + scale_fill_gradientn(colors = hm.palette(100)) + scale_y_continuous(breaks = 1:7,labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")) + labs(x="Hour of Day",y="Day of Week")
Here is the result:
What does this figure show? In general, more fatalities occurred between the hours of 3 PM and 9 PM, due I would imagine to the higher number of cars being driven during these hours, and reduced visibility as the sun goes down. On the other hand, the hours between 11 PM and 4 AM are quite safe, at least on Monday through Thursday. Although there is complete darkness at this time, fewer vehicles are out on the roads.
Incidentally, the hours of the week with the most fatal accidents in 2015 were Sunday morning at 2 AM and Saturday evening at 8 PM. Drivers take heed!
One thing that surprised me was the large number of fatal accidents occurring on Sunday evenings. So the larger number of accidents on Sunday overall is not just due to the continuation of Saturday night. I’m usually at home having dinner on Sunday evening so I’m not sure why this is. Are these people coming home from weekend trips?