Over a period of several months, I and a friend played 189 games of Words With Friends, a Scrabble-like game popular on Facebook. I kept track of our scores, and the resulting dataset — which I make available here — provides a couple of insights into the game.
The structure of the file itself is very simple: one line per game, with each line containing my score followed by my opponent’s score. There is no need to record the game number, since those are equivalent to the row numbers that are added automatically. Using the readr package:
require("tidyverse") scores <- read_csv("gameset1.csv")
Here’s what the data looks like:
> scores # A tibble: 189 × 2 Me Opp1 1 360 313 2 365 388 3 458 349 4 378 419 5 440 348 6 388 353 7 358 376 8 332 379 9 362 325 10 353 326 # ... with 179 more rows
(Note that if you print scores in R, the print routine for the tibble will also provide the type of each column. But since the type [int for these columns] is enclosed in angle brackets, WordPress apparently thinks they’re HTML commands and so does not print them.)
If the fundamental unit of observation is the game, then the data is about as tidy as it can get. However, to compare the respective distributions of our scores using ggplot, I needed all the scores to be in the same column. This gave me the chance to use tidyr::gather for the first time. Also I had to add in a column for game number after all:
sep_scores <- scores %>% mutate(game = as.integer(row.names(scores))) %>% gather(player,points,-game) %>% arrange(game)
The data frame sep_scores then looks like this:
> sep_scores # A tibble: 378 × 3 game player points 1 1 Me 360 2 1 Opp1 313 3 2 Me 365 4 2 Opp1 388 5 3 Me 458 6 3 Opp1 349 7 4 Me 378 8 4 Opp1 419 9 5 Me 440 10 5 Opp1 348 # ... with 368 more rows
The following code generates the paired density plot:
ggplot(sep_scores,aes(points,fill=player)) + geom_density(alpha=0.65,bw=25) + labs(title = "Density Plot of Scores", x = "Points per game")
I was surprised at how normal my scores looked (in the Gaussian sense). In fact, they passed a Shapiro-Wilk normality test with p = 0.9 (actually, the p-score means we can’t reject the hypothesis that my scores are drawn from a normal distribution). The mean of my scores is 391 and the standard deviation is 50.2.
My friend’s scores were a bit more skewed. They failed a Shapiro-Wilk test, with p = 0.003 (meaning there is sufficient evidence to reject the null hypothesis that the scores are drawn from a normal distribution). My friend’s scores had a mean of 342 and a standard deviation of 43.6.
Since the player with the higher score wins the game, what really matters is the difference between the scores. To generate a density plot of this difference, I added a new column delta to scores:
scores <- scores %>% mutate(delta = Me - Opp1) ggplot(scores,aes(delta)) + geom_density(fill="gold1",alpha=1/2,bw=25) + labs(title="Density Plot of Winning Margin", x = "Winning Margin")
Although the distribution appears bimodal, it does pass a Shapiro-Wilk test for normality, with p = 0.25. The mean is 49.3 and the standard deviation is 77.4. This means I beat my friend by an average of 49 points per game. But the large standard deviation means he wins about 25% of the games.
The figure above looked strangely familiar. Where had I seen it before? Of course:
It’s not a hat, it’s a boa constrictor swallowing an elephant! [The reference is to Le Petit Prince, a book fondly remembered by many students who studied French in high school.]
My next question was whether our scores were correlated. There are three possibilities here. First, there could be no correlation at all; our scores in each game are just random draws from our respective score distributions.
The second possibility is that the scores are positively correlated. It could be that when one player does well, the other player rises to the challenge and does well too.
The third possibility is that the scores are negatively correlated. A possible explanation here is that the J, Q, X and Z tiles are worth a lot of points, so the more one player gets, the fewer the other player gets.
Let’s look at a plot of my opponent’s score vs. my score for each game, along with the regression line:
ggplot(scores,mapping=aes(Me,Opp1)) + geom_point(color=ifelse(scores$delta>0,"blue","red"),size=2.5,shape=19) + geom_abline(intercept = mod_coef, slope = mod_coef, size=1) + labs(title = "Scores in 189 games",x = "My Score", y = "Opponent's Score") + coord_cartesian(ylim = c(100, 600))
Points in blue represent games I won, and points in red are games I lost. It looks like the correlation is negative, meaning the higher my score is, the lower my opponent’s score (and vice versa). So the game is to some extent “zero sum”.
Let’s look at the details of the linear regression:
summary(lm(Opp1 ~ Me, data=scores))
Call: lm(formula = Opp1 ~ Me, data = scores) Residuals: Min 1Q Median 3Q Max -106.915 -27.256 -1.494 23.824 131.806 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 464.98353 23.36229 19.903 < 2e-16 *** Me -0.31417 0.05921 -5.306 3.15e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 40.72 on 187 degrees of freedom Multiple R-squared: 0.1308, Adjusted R-squared: 0.1262 F-statistic: 28.15 on 1 and 187 DF, p-value: 3.154e-07
The r-squared is only 0.1308, but the regression is statistically significant. The slope of -0.31417 means that for every ten points I get, my opponent’s score is about 3 points lower.
So it seems that the score of each game is a draw from a bivariate normal distribution.