Berkeley Earth vs. NOAA/NCDC

In honor of Earth Day I’m continuing my analysis of temperature data for the United States. We’ve already looked at the state-level data produced by Berkeley Earth, and used it to develop a map of the rise in annual mean temperature (AMT) for the lower 48 US states over the past century. NOAA/NCDC also provides time series of monthly mean temperatures for the individual US States (the data are available near the bottom of this page in the file climdiv-tmpcst-v1.0.0-20170404). We can use it to produce a similar map of the AMT rise for each state and compare it with a map developed using the Berkeley data.

To estimate statewide rise in AMT over a given time interval, I use loess to generate a smooth curve through the annual temperatures, then take the difference between the smoothed temperatures at the endpoints of the interval.

As an example, below is a plot of AMT for the state of New Jersey over time, as calculated from the NOAA/NCDC monthly data. The blue curve is the loess fit, and the gray envelope corresponds to the standard error bounds.

An estimate of the rise in annual mean temperature between 1920 and 2010 is then 54 – 51 = 3 °F. This is just an estimate because the choice of how to smooth out the year-to-year variations is arbitrary, and there are a number of ways to accomplish it. For example we could have taken a running average of the AMT over the previous 5 years. Or the previous 10 years. The loess algorithm contains a similar parameter to controls the bandwidth of the smoothing process.

In the figure above, note that there is more uncertainty in the estimate for the years corresponding to the beginning and end of the dataset. The reason is that to estimate the smoothed AMT at any point in time, the loess algorithm uses the data before and after that point (within a window of a certain width). For the initial point, the estimate is based only on the successive data points. Likewise for the last point, the estimate is based only on the preceding data points. Estimates for these two points are based on only half the data available for other points in time, so the uncertainty of the estimates is higher.

To reduce the uncertainty in comparing the two datasets, I decided to calculate the rise in mean annual temperature for each state from 1901 through 2000. Both of these endpoints are well away from the beginning and end of the time series for all states, for both the Berkeley Earth and the NOAA datasets.

The code, though long, is relatively straightforward and mostly involves cleaning and processing the data. The Berkeley Earth and NOAA data are in different forms, so the cleaning process is somewhat different for each one.

require(tidyverse)
require(lubridate)
library(ggmap)
library(maps)
library(mapdata)
library(stringr)
library(RColorBrewer)

delT <- function(av,yr,deg) {
  ySer <- first(yr):last(yr)
  smoothed <- loess(av ~ ySer)
  newT <- predict(smoothed,data.frame(ySer=c(1901,2000)),se=FALSE)
  corr <- ifelse(deg == "C",1.8,1)
  degF <- corr*(newT[2] - newT[1])
  return(degF)
}

setwd("C:/Users/.../T_history")
regYr <- c(31,28,31,30,31,30,31,31,30,31,30,31)
lpYr <- regYr; lpYr[2] <- lpYr[2] + 1

myPalette <- colorRampPalette(brewer.pal(16, "Reds"))

sNames <- read_csv("StateNames.csv",col_names=TRUE)

#First process the Berkeley Earth data
statesB <- read_csv("GlobalLandTemperaturesByState.csv") %>%
  mutate(yr = year(dt),mon = month(dt)) %>%
  filter(Country == "United States",
    yr<2013,State != "Alaska",State != "Hawaii",
    State != "District Of Columbia") %>%
  mutate(State = replace(State,State == "Georgia (State)","Georgia")) %>%
  rename(Tmon = AverageTemperature) %>%
  select(dt,Tmon,State,yr,mon)

sumdat <- statesB %>% group_by(State) %>%
  summarise(earliest = ifelse(any(is.na(Tmon)),1+max(yr[is.na(Tmon)]),
    ifelse(month(min(dt)) == 1,min(yr),1+min(yr))))

statesB <- left_join(statesB,sumdat) %>%
  filter(yr>=earliest) %>%
  select(State,yr,mon,Tmon) %>%
  group_by(State,yr) %>%
  summarise(av=ifelse(leap_year(first(yr)),
    weighted.mean(Tmon,lpYr),
    weighted.mean(Tmon,regYr))) %>%
  group_by(State) %>%
  summarise(Berkeley = delT(av,yr,"C")) %>%
  rename(region=State) %>%
  mutate(region = tolower(region))

# Now process the NCDC data
statesN <- read_csv("climdiv-states.csv",col_names=TRUE,     col_types = cols(col_character(),col_double(),col_double(),     col_double(),col_double(),col_double(),col_double(),col_double(),     col_double(),col_double(),col_double(),col_double(),col_double())) %>%
  mutate(id = as.integer(str_sub(code,1,3)),
    yr = as.integer(str_sub(code,str_length(code)-3,str_length(code))),
    leap = leap_year(yr)) %>%
  filter(yr<=2016) %>%
  rowwise() %>%
  mutate(Tmean = ifelse(leap,
    weighted.mean(c(jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec),lpYr),
    weighted.mean(c(jan,feb,mar,apr,may,jun,jul,aug,sep,oct,nov,dec),regYr))) %>%
  group_by(id) %>%
  summarise(NCDC = delT(Tmean,yr,"F")) %>%
  left_join(sNames) %>%
  rename(region = state) %>%
  select(region,NCDC)

tRise <- left_join(statesB,statesN) %>%
  gather(dSource,Tdiff,-region)

lo <- min(statesN$NCDC,statesB$Berkeley)
hi <- max(statesN$NCDC,statesB$Berkeley)
sc <- scale_fill_gradientn(colours = myPalette(16), limits=c(lo, hi))
us <- map_data("state")
gg1 <- ggplot() +
  geom_map(data=us, map=us,
    aes(x=long, y=lat, map_id=region),
    fill="#ffffff", color="#ffffff", size=0.15) +
  geom_map(data=tRise, map=us,
    aes(fill=Tdiff, map_id=region),
    color="#ffffff", size=0.15) + sc +
  coord_map("albers", lat0 = 39, lat1 = 45) +
  theme(panel.border = element_blank()) +
  theme(panel.background = element_blank()) +
  theme(axis.ticks = element_blank()) +
  theme(axis.text = element_blank()) +
  theme(plot.title = element_text(hjust = 0.5)) +
  labs(x=NULL,y=NULL,
    title = "Increase in Statewide Mean Annual Temperature, 1901-2000 (°F)") +
  facet_grid(dSource ~ .)
 gg1

This code produces the figure below, which compares the statewide rise in AMT over the 20th century as estimated using the Berkeley Earth and NOAA datasets. It is evident that there are some differences between the two. My impression is that the map based on the Berkeley Earth data looks somewhat smoother, in that it shows a gradual change between smaller temperature rise in the south, and larger temperature rise in the north. The map based on the NOAA data shows a similar differentiation between north and south, but there are some outliers: the state of New Jersey, for example, appears to have a much larger rise in temperature than neighboring states.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s