World Series Statistics

World Series - Chicago Cubs v Cleveland Indians - Game Seven

Ever since the Chicago Cubs’ victory in game 7 of the 2016 World Series — arguably one of the most riveting and entertaining World Series in baseball history — I’ve been wondering what the probability is of winning the Series after falling behind three games to one as the Cubs did.

Those unfamiliar with the American game of baseball need not worry; the rules are unimportant for this discussion. It’s sufficient to know that the World Series is a contest in which two teams play a series of baseball games, and the first team to win four games wins the series. A game always ends in a win for one team or the other; there can be no ties. Note that this means there can be no more than seven games in the World Series.

For the purposes of this analysis I’m going to represent a historical World Series as a string of binary digits, where a 1 corresponds to a win by the team that eventually wins the series, and a 0 corresponds to a win by the other team. So for example the string 0100111 represents a series in which the eventual series winner won games 2, 5, 6 and 7 and lost games 1, 3, and 4 (as Chicago did in 2016). Likewise the string 01111 represents a series in which the winning team lost the first game but won games 2,3,4 and 5. We’ll call this string the form of the series.

I’ll skip over how I tabulated these (it involved the data files on retrosheet.org) but here is a table presenting the frequency of each form, for each of the 107 best-of-seven World Series played between 1905 and 2015:

      form freq
1     1111   21
2    11101    3
3    11011    8
4    10111   10
5    01111    4
6   111001    0
7   110101    1
8   101101    3
9   011101    7
10  110011    4
11  101011    1
12  011011    0
13  100111    2
14  010111    3
15  001111    3
16 1110001    0
17 1101001    1
18 1011001    2
19 0111001    0
20 1100101    1
21 1010101    3
22 0110101    2
23 1001101    3
24 0101101    2
25 0011101    4
26 1100011    3
27 1010011    3
28 0110011    3
29 1001011    0
30 0101011    4
31 0011011    1
32 1000111    0
33 0100111    3
34 0010111    2
35 0001111    0

Given the table above (and assuming the data is stored in a tibble called table) we can easily find the probability of going on to win the Series after being down three games to one. Note that the code below assumes you have loaded magrittr, dplyr, and stringr from the tidyverse collection:

n2 <- table$form %>%
  str_sub(1,4) %>%
  str_count("1") %>%
  (function(x) ifelse(x==3,TRUE,FALSE))

n1 <- table$form %>%
  str_sub(1,4) %>%
  str_count("1") %>%
  (function(x) ifelse(x==1,TRUE,FALSE))

win2 <- table %>%
  filter(n2) %>%
  summarize(sum(freq))

win1 <- table %>%
  filter(n1) %>%
  summarize(sum(freq))

prob <- win1 / (win1 + win2)

This gives the probability of winning as 11.4%. In other words, given the historical record, teams that fell behind in the Series by 3 games to 1 went on to win the series only 11.4% of the time. So the Cubs did overcome some very difficult odds to win the Series!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s