Scatterplots and Variants

PH345: Winter 2025

Phil Boonstra

Edward R Tufte (1942-)

American political scientist, statistician, and professor emeritus at Yale University

‘Godfather’ of data visualization and visual presentation of information

Author of Visual Display of Quantitative Information (2001)

Photo by Keegan Peterzell - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=40367115

Nine principles of graphical excellence

  1. Show the data
  2. Induce the viewer to think about the substance
  3. Be integrated with the statistical and verbal descriptions of data
  4. Avoid distorting what the data say
  5. Present many numbers in small space
  6. Make large data sets coherent
  7. Reveal the data at several levels of detail, from broad overview to fine structure
  8. Encourage comparison between different pieces of data
  9. Serve clear purpose: description, exploration, tabulation or decoration

Graphical excellence

1. Show the data

Three datasets:

Dataset
1 2 3
n 142 142 142
mean.x 54.3 54.3 54.3
sd.x 16.8 16.8 16.8
mean.y 47.8 47.8 47.8
sd.y 26.9 26.9 26.9
cor.xy -0.07 -0.06 -0.07

All datasets have the nearly equal summary statistics:

  • number of observations
  • mean of x and y
  • standard deviation of x and y
  • correlation between x and y (and same regression of y on x)

Dataset Intercept Slope
away 53.43 -0.10
bullseye 53.81 -0.11
circle 53.80 -0.11
dino 53.45 -0.10
dots 53.10 -0.10
h_lines 53.21 -0.10
high_lines 53.81 -0.11
slant_down 53.85 -0.11
slant_up 53.81 -0.11
star 53.33 -0.10
v_lines 53.89 -0.11
wide_lines 53.63 -0.11
x_shape 53.55 -0.11

Graphical excellence

2. Induce the viewer to think about the substance

3. Be integrated with the statistical and verbal descriptions of data

Scatterplots are simplest bivariate plots

Steps:

  1. Set of paired numbers \((x_i, y_i)\) where \(i\) indexes pairs, e.g. \((x_1, y_1)\) is first pair, \((x_2, y_2)\) is second pair, etc.

  2. Place points on a cartesian coordinate system. Labeling of points reflects assumption that \(x_i\) goes on the x-axis, \(y_i\) goes on y-axis

Ex 1: Lung cancer, cigarettes

Lung-cancer deaths per million in 1950 (\(y\)) against annual per-capita cigarette consumption in 1930 (\(x\)) for 11 countries.

Figure 7 (Doll, 1955)

Graphical excellence

4. Avoid distorting what the data say

Scatterplots imply a relationship

So don’t create a scatterplot if you don’t want to imply a relationship.

https://www.tylervigen.com/spurious-correlations

Poor choice of scale

Figures 3.6 (top) and 3.5 (bottom), Wilke (2019)

Dependent vs Independent Variables

Two main types of scatterplots:

  1. \(x\) and \(y\) are both uncontrolled. Goal is to show whether they are co-varying

  2. \(x\) is controlled or “independent” variable, e.g. time, age, dose, or an experimentally controlled variable.

William Playfair (1759-1823)

Scottish engineer, economist, proto-government-spy, and many other things

When he wasn’t blackmailing lords and being sued for libel, William Playfair invented the pie chart, the bar graph, and the line graph

Cara Giamo, 2016

Ex 2: Playfair’s graph of prices, wages, and British monarchs

Never at any former time was wheat so cheap, in proportional to mechanical labor, as it is in the present time (Playfair)

Figure 3, Friendly and Denis (2005); Originally from Tufte, p34

Modernization Attempt 1

“Pure” scatterplot

Direct scatterplot of wheat price and wage, connected by consecutive years

Figure 5, Friendly and Denis (2005)

Modernization Attempt 2

Create new variable

Now very easy to see Playfair’s claim about inflation-adjusted price of wheat. Statistical graphics should reveal data (Tufte, p13)

Figure 5, Friendly and Denis (2005)

Graphical excellence

5. Present many numbers in small space

6. Make large data sets coherent

7. Reveal the data at several levels of detail, from broad overview to fine structure

Ex 3: Temperature anomalies

Land and ocean anomalies from 1850 to 2024 with respect to the 1901-2000 average

Separate data for northern and southern hemispheres

Northern hemisphere

# temps_northern is the object name I chose
ggplot(temps_northern) +
  geom_point(aes(x = Year, y = Anomaly), size = 1) + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24))

Average temperature anomalies in the northern hemisphere over time

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Connected lines

ggplot(temps_northern) +
  geom_point(aes(x = Year, y = Anomaly), size = 1) + 
  geom_line(aes(x = Year, y = Anomaly)) +
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24))

Emphasis on interyear variability

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Smooth interpolation

ggplot(temps_northern) +
  geom_point(aes(x = Year, y = Anomaly), size = 1) + 
  geom_smooth(aes(x = Year, y = Anomaly), method = "gam", se = FALSE) +
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24))

Emphasis on trend

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Bars

ggplot(temps_northern) +
  geom_col(aes(x = Year, y = Anomaly), position = "dodge") + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24))

Emphasis on positive vs negative deviation

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Ribbon

ggplot(temps_northern) +
  geom_ribbon(aes(x = Year, ymin = 0, ymax = Anomaly), alpha = 0.6) + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24))

Emphasis on positive vs negative deviation, also on time spent above or below

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Graphical excellence

8. Encourage comparison between different pieces of data

Northern, Southern hemispheres

temps <-
  bind_rows(
    bind_cols(hemi = "Southern", temps_southern), 
    bind_cols(hemi = "Northern", temps_northern)
  )

ggplot(temps) +
  geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24), 
        legend.position = "inside",
        legend.position.inside = c(0.15, 0.7))

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

ggplot(temps) +
  geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) + 
  geom_step(aes(x = Year, y = Anomaly, color = hemi), direction = "mid") +
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24), 
        legend.position = "inside",
        legend.position.inside = c(0.15, 0.7))

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

ggplot(temps) +
  geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) + 
  geom_smooth(aes(x = Year, y = Anomaly, color = hemi), method = "gam", se = FALSE) +
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24), 
        legend.position = "inside",
        legend.position.inside = c(0.15, 0.7))

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

ggplot(temps) +
  geom_col(aes(x = Year, y = Anomaly, fill = hemi), position = "dodge") + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_fill_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24), 
        legend.position = "inside",
        legend.position.inside = c(0.15, 0.7))

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

ggplot(temps) +
  geom_ribbon(aes(x = Year, ymin = 0, ymax = Anomaly, fill = hemi), alpha = 0.6) + 
  scale_x_continuous(name = NULL, expand = expansion(0.01)) +
  scale_fill_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
  scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) + 
  theme(text = element_text(size = 24), 
        legend.position = "inside",
        legend.position.inside = c(0.15, 0.7))

https://www.ncei.noaa.gov/access/monitoring/climate-at-a-glance/global/time-series

Graphical excellence

9. Serve clear purpose: description, exploration, tabulation or decoration

Ex 4: Twitter followers

Description: Total number of followers Twiter Blue account has as of early November, 2022 against day joining twitter

Scan and commentary from https://junkcharts.typepad.com/junk_charts/2022/12/the-blue-mist.html; original article at https://www.nytimes.com/interactive/2022/11/23/technology/twitter-elon-musk-twitter-blue-check-verification.html

What is the purpose? Hard to know whether blue users have a lot of followers; would expect time bias but there doesn’t seem to be one (no funnel effect); why are some points labeled?

Code Together Task

No Spice: Make the Global Temperature Anomaly plots on slide 24

Weak Sauce: Make one of the plots on slides 25-28

Medium Spice: Make one of the plots on slides 30-34

Dim Mak: Make the plot on the next slide

Detroit Temperatures, 1965-2024

detroit_temps <- 
  bind_cols(city = "Detroit",
            # Your path will be different than mine: this is the
            # relative path with respect to my slides
            # I'm using skip = 4 to skip the first 4 lines
            read_csv("../../CityTemps/detroit.csv", skip = 4)) %>%
  # One temperature is -99. Presumably this is a flag for missingness
  filter(Value != -99) %>%
  # Split up the date variable
  separate_wider_position(cols = Date, widths = c("Year" = 4, "Month"= 2)) %>%
  mutate(Year = factor(Year), 
         MonthNum = as.numeric(Month),
         # Map integers to month abbrevations:
         MonthFct = c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct","Nov", "Dec")[MonthNum] %>% fct_inorder(ordered = T))

detroit_temps_avg <-
  detroit_temps %>%
  group_by(MonthFct) %>%
  summarize(Value = mean(Value))

ggplot(detroit_temps,
       aes(x = MonthFct, y = Value)) + 
  geom_line(aes(group = Year), alpha = 0.35) + 
  geom_line(data = filter(detroit_temps, Year == "2024"),  aes(group = 1), color = "darkred", size = 2) + 
  geom_line(data = detroit_temps_avg,  aes(group = 1), linetype = "dashed", size = 2) +
  labs(x = NULL, y = "Temp (F)") +
  theme(text = element_text(size = 24))

Faint lines = Year; Dark red line = 2024; Dashed line = Average

References

Doll, R., 1955. Etiology of lung cancer. In Advances in cancer research (Vol. 3, pp. 1-50).

Friendly, M. and Denis, D., 2005. The early origins and development of the scatterplot. Journal of the History of the Behavioral Sciences, 41(2), pp.103-130.

Tufte, E.R., 2001. The visual display of quantitative information.