Now that retrosheet.org has posted complete data for the 2016 Major League Baseball season, I was curious to see how the number of games won by each team correlates with their runs differential — i.e. the difference between the total number of runs scored and the total number of runs allowed over the season.

The file GL2016.txt, available here, contains a plethora of data: 161 fields for each game played during the regular season. Here I’m interested in only four items: the Visiting Team and the number of runs they scored, and the Home Team and the number of runs they scored. After reading in the data (using *read_csv*), I used the *select* function from *dplyr* to choose only those four fields:

# yet another baseball visualization require(tidyverse) s2016 <- read_csv("GL2016.txt",col_names = FALSE) s2016 <- s2016 %>% select(VisTeam=X4,HomeTeam=X7, VisScore=X10,HomeScore=X11)

The data frame s2016 looks like this:

> s2016 # A tibble: 2,428 × 4 VisTeam HomeTeam VisScore HomeScore 1 NYN KCA 3 4 2 TOR TBA 5 3 3 SLN PIT 1 4 4 CHN ANA 9 0 5 MIN BAL 2 3 6 CHA OAK 4 3 7 TOR TBA 5 3 8 SEA TEX 2 3 9 COL ARI 10 5 10 WAS ATL 4 3 # ... with 2,418 more rows

My objective then was to build a new data frame containing the total runs differential and the total number of games won for each team. I did this in two stages: first for home games, and then for road games. Consider the home games first:

HomeGames <- s2016 %>% group_by(HomeTeam) %>% summarize(RS_H = sum(HomeScore),RA_H = sum(VisScore), W_H = sum(HomeScore>VisScore)) %>% rename(Team=HomeTeam)

The above script chains together three different *dplyr* verbs using the pipe. After *group*ing by HomeTeam, I sum up the number of runs scored, runs allowed, and the number of wins for each team. The suffix _H means that each of these sums is for home games only.

With the runs counted up this way there’s no longer a need to maintain the distinction between the home and visiting teams, so I renamed the HomeTeam column as Team. This will be useful later on.

I followed the same procedure for road games:

RoadGames <- s2016 %>% group_by(VisTeam) %>% summarize(RS_R = sum(VisScore),RA_R = sum(HomeScore), W_R = sum(VisScore>HomeScore)) %>% rename(Team=VisTeam)

The final data comes together with a full join of HomeGames and AwayGames, followed by some manipulations using *mutate*:

AllGames <- full_join(HomeGames,RoadGames) %>% mutate(Diff = RS_H +RS_R - RA_H - RA_R, Wins = W_H + W_R) %>% arrange(desc(Diff)) %>% select(Team,Diff,Wins)

I needed the regression equation for a linear model relating the number of wins to the runs differential so that I could place it on the figure I wanted to make:

lmod <- lm(Wins~Diff,data = AllGames) lm_coef <- round(coef(lmod), 3) mylab <- paste("Linear Fit: Wins = ",lm_coef[2],"*Run_Diff + ",lm_coef[1],sep="")

The script for the figure is:

ggplot(AllGames,aes(x=Diff,y=Wins)) + geom_point(size=3,color="blue") + geom_smooth(method = "lm", color = "black") + ggtitle("2016 Major League Baseball Season") + xlab("Run_Diff = Runs Scored - Runs Allowed") + ylab("Games Won") + geom_label(aes(label=mylab,x=-100,y=105),size=4.5, label.padding = unit(0.5,"lines"))

which produces the following plot:

The regression coefficients make sense: a team that scores as many runs as their opponents (Run_Diff = 0) will win half their games on average, or 81 out of 162. Likewise there’s an adage in sabrmetrics that 10 runs translates to one additional win; 10 times the coefficient of Run_Diff is 9.3, which is pretty close to 10.

In his excellent book Analyzing Baseball Data with R, Jim Albert obtains the following regression based on data from the 2001 through 2011 seasons:

*WPct* = 0.000623 * *Run_Diff + *0.499992

To be comparable with the equation shown on the plot, we have to multiply both terms by 162, the number of games in a season. That gives:

*Wins* = 0.101 * *Run_Diff* + 80.987

…which is not far from my result for 2016.

The Chicago Cubs are represented by the point at the upper right on the plot. During the regular season they scored 252 more runs than their opponents and won 103 games. The regression line, however, predicts 104 wins. One possible reason for the difference is that the Cubs recorded a rare tie game with the Pittsburgh Pirates on September 29, 2016 when play was suspended in the fifth inning due to rain. MLB decided to call the game a tie since it had no effect on the standings by that point. Aside from that game the Cubs were 14-4 against the Pirates on the season, so it’s very likely that Chicago would have won had the game been played to completion.