In this post I’m going to take a look at some data from Berkeley Earth that I obtained from this page on kaggle.com. It’s important to understand the content of these files: Berkeley Earth has produced time series of monthly average temperatures over long periods of time for various land surface locations around the globe. They use a technique called kriging that allows them to combine data from multiple nearby sites to produce more accurate estimates of the actual temperature over time at a given location. The important point is that these adjusted temperatures do not necessarily correspond to *measured* temperatures at any particular location. The statistical procedure they use is intended to correct for the numerous factors that affect the accuracy of site temperature measurements: urban heat island effects, and changes in weather station location, instrumentation, time of day at which temperatures are measured, etc. A more complete discussion of this topic can be found here.

Continue reading “Climate Change by US State”

# 2016 Baseball Outliers

Below I’ve repeated the figure I generated in the last post, but this time with the teams labeled:

Why are the Texas Rangers (TEX) such outliers? Given that they scored only 8 more runs than their opponents, one would expect the team to win only 81 games. Yet they won 95 times in 2016. Why is this?

It comes down to their performance in one-run games. Texas was 36 and 11 (.766) in one-run games last year, a modern-day record for Major League Baseball. But it’s a rather dubious record. As Max Marchi and Jim Albert state in *Analyzing Baseball Data With R*:

Winning a disproportionate number of close games is sometimes attributed to luck. However, teams with certain attributes may be more likely to systematically win contests decided by a narrow margin. For example, teams with top quality closers tend to preserve small leads …

Of course the Dallas News tried to put a better face on it at the end of the season when Texas won the AL West, but the team was eventually swept three games to none by the Toronto Bluejays in the division series.

# 2016 Baseball Season

Now that retrosheet.org has posted complete data for the 2016 Major League Baseball season, I was curious to see how the number of games won by each team correlates with their runs differential — i.e. the difference between the total number of runs scored and the total number of runs allowed over the season.

The file GL2016.txt, available here, contains a plethora of data: 161 fields for each game played during the regular season. Here I’m interested in only four items: the Visiting Team and the number of runs they scored, and the Home Team and the number of runs they scored. After reading in the data (using *read_csv*), I used the *select* function from *dplyr* to choose only those four fields:

# yet another baseball visualization require(tidyverse) s2016 <- read_csv("GL2016.txt",col_names = FALSE) s2016 <- s2016 %>% select(VisTeam=X4,HomeTeam=X7, VisScore=X10,HomeScore=X11)

The data frame s2016 looks like this:

# Words With Friends Scores

Over a period of several months, I and a friend played 189 games of Words With Friends, a Scrabble-like game popular on Facebook. I kept track of our scores, and the resulting dataset — which I make available here — provides a couple of insights into the game.

The structure of the file itself is very simple: one line per game, with each line containing my score followed by my opponent’s score. There is no need to record the game number, since those are equivalent to the row numbers that are added automatically. Using the readr package:

require("tidyverse") scores <- read_csv("gameset1.csv")

Here’s what the data looks like:

> scores # A tibble: 189 × 2 Me Opp1 1 360 313 2 365 388 3 458 349 4 378 419 5 440 348 6 388 353 7 358 376 8 332 379 9 362 325 10 353 326 # ... with 179 more rows

(Note that if you print *scores* in R, the print routine for the tibble will also provide the type of each column. But since the type [*int* for these columns] is enclosed in angle brackets, WordPress apparently thinks they’re HTML commands and so does not print them.)

# 2015 Traffic Fatalities

To gain more experience with Hadley Wickham’s tidyverse, I’m going to look at a dataset of US traffic fatalities occurring in 2015, as compiled by the National Highway Traffic Safety Administration (NHTSA). I found this dataset on the Kaggle website.

The file **accident.csv **used in this post** **contains 52 different fields pertaining to 32,166 fatal automobile accidents. Data field specifications can be found in the Analytical User’s Manual.

Let’s begin with something very simple: the number of fatal accidents by day of the week. Ex ante, I expect that most such accidents would occur on Friday and Saturday, which seem to be the days with the heaviest traffic — especially in the evening hours. Here’s my code to generate a bar chart:

library("tidyverse") accidents <- read_csv("accident.csv") ggplot(data=accidents) + geom_bar(mapping=aes(x=DAY_WEEK),color="deepskyblue2",fill="deepskyblue2") + scale_x_continuous(breaks = 1:7,labels = c("Sun","Mon","Tue","Wed","Thu","Fri","Sat")) + labs(x="Day of Week",y="Number of Accidents") + ggtitle("2015 Fatal Traffic Accidents")

Of course, ggplot produces a perfectly serviceable bar chart with just the following:

ggplot(data=accidents) + geom_bar(mapping=aes(x=DAY_WEEK))

…but I couldn’t resist tweaking it. I wanted to see the days of the week written out on the x-axis rather than the numbers 1 to 7. I also added a title and changed the default bar colors. Here’s the plot:

# Gambling Problem (part 2)

The last post concerned a game based on a slot machine that generates random numbers between 0 and 999. The player keeps pulling the handle to generate a sequence of these numbers, and the machine keeps track of them. The game ends when any number shows up a second time. That is, as long as all the numbers generated are unique, the player keeps pulling the handle. Simulation of the game showed that the expected number of pulls until the game ends is about 40.3. In this post I’ll develop an analytical solution.

Of course it’s not possible to win on the first pull, but what is the probability of winning on the second pull? The machine has already generated a number between 0 and 999. The probability of generating that same number again is , so .

To win on the third pull, the numbers generated on the first two pulls must be different. After the first number is generated, the probability of generating a different number on the second pull is . The probability that the third number will be one of the two already generated is , so .

To win on the fourth pull, the numbers generated on the first three pulls must all be different. The probability of that happening is . The probability that the fourth number is the same as one of the first three is . So .

Now we can write the general expression for the probability of winning on the k^{th} pull:

# Gambling Problem (part 1)

I’m not a gambler myself, but I do enjoy probability and statistics problems associated with games of chance. Here’s one I came up with recently:

A manufacturer has invented a new game of chance based on a three-reel slot machine. Instead of fruit, each reel contains the digits from 0 through 9, so that pulling the handle generates a number between 000 and 999.

The game is played as follows: a player inserts a coin and pulls the handle repeatedly to generate a series of random numbers. The machine keeps track of the numbers generated on each pull. As long as all the numbers are different, the player continues to pull the handle. The game ends when any number comes up a second time. The machine then pays out a dollar for each time the handle was pulled.

For example, here are the numbers generated in a representative game:

{94, 845, 913, 994, 96, 269, 377, 913}

Since the number generated on the 8th pull (913) is a repeat of the number generated on the third pull, the game ends there and the machine pays out 8 dollars.

On average, how many times can a player expect to pull the handle before the game ends?

First I’ll write a program to simulate this game, and in the next post I’ll develop an analytical solution. The solution begins after the fold.

Continue reading “Gambling Problem (part 1)”