1 | 2 | 3 | |
---|---|---|---|
n | 142 | 142 | 142 |
mean.x | 54.3 | 54.3 | 54.3 |
sd.x | 16.8 | 16.8 | 16.8 |
mean.y | 47.8 | 47.8 | 47.8 |
sd.y | 26.9 | 26.9 | 26.9 |
cor.xy | -0.07 | -0.06 | -0.07 |
PH345: Winter 2025
American political scientist, statistician, and professor emeritus at Yale University
‘Godfather’ of data visualization and visual presentation of information
Author of Visual Display of Quantitative Information (2001)
Photo by Keegan Peterzell - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=40367115
1. Show the data
Three datasets:
Dataset
|
|||
---|---|---|---|
1 | 2 | 3 | |
n | 142 | 142 | 142 |
mean.x | 54.3 | 54.3 | 54.3 |
sd.x | 16.8 | 16.8 | 16.8 |
mean.y | 47.8 | 47.8 | 47.8 |
sd.y | 26.9 | 26.9 | 26.9 |
cor.xy | -0.07 | -0.06 | -0.07 |
All datasets have the nearly equal summary statistics:
Dataset | Intercept | Slope |
---|---|---|
away | 53.43 | -0.10 |
bullseye | 53.81 | -0.11 |
circle | 53.80 | -0.11 |
dino | 53.45 | -0.10 |
dots | 53.10 | -0.10 |
h_lines | 53.21 | -0.10 |
high_lines | 53.81 | -0.11 |
slant_down | 53.85 | -0.11 |
slant_up | 53.81 | -0.11 |
star | 53.33 | -0.10 |
v_lines | 53.89 | -0.11 |
wide_lines | 53.63 | -0.11 |
x_shape | 53.55 | -0.11 |
2. Induce the viewer to think about the substance
3. Be integrated with the statistical and verbal descriptions of data
Steps:
Set of paired numbers \((x_i, y_i)\) where \(i\) indexes pairs, e.g. \((x_1, y_1)\) is first pair, \((x_2, y_2)\) is second pair, etc.
Place points on a cartesian coordinate system. Labeling of points reflects assumption that \(x_i\) goes on the x-axis, \(y_i\) goes on y-axis
Lung-cancer deaths per million in 1950 (\(y\)) against annual per-capita cigarette consumption in 1930 (\(x\)) for 11 countries.
Figure 7 (Doll, 1955)
4. Avoid distorting what the data say
So don’t create a scatterplot if you don’t want to imply a relationship.
Figures 3.6 (top) and 3.5 (bottom), Wilke (2019)
Two main types of scatterplots:
\(x\) and \(y\) are both uncontrolled. Goal is to show whether they are co-varying
\(x\) is controlled or “independent” variable, e.g. time, age, dose, or an experimentally controlled variable.
Scottish engineer, economist, proto-government-spy, and many other things
When he wasn’t blackmailing lords and being sued for libel, William Playfair invented the pie chart, the bar graph, and the line graph
Cara Giamo, 2016
Never at any former time was wheat so cheap, in proportional to mechanical labor, as it is in the present time (Playfair)
Figure 3, Friendly and Denis (2005); Originally from Tufte, p34
Direct scatterplot of wheat price and wage, connected by consecutive years
Figure 5, Friendly and Denis (2005)
Now very easy to see Playfair’s claim about inflation-adjusted price of wheat. Statistical graphics should reveal data (Tufte, p13)
Figure 5, Friendly and Denis (2005)
5. Present many numbers in small space
6. Make large data sets coherent
7. Reveal the data at several levels of detail, from broad overview to fine structure
Land and ocean anomalies from 1850 to 2024 with respect to the 1901-2000 average
Separate data for northern and southern hemispheres
Average temperature anomalies in the northern hemisphere over time
Emphasis on interyear variability
ggplot(temps_northern) +
geom_point(aes(x = Year, y = Anomaly), size = 1) +
geom_smooth(aes(x = Year, y = Anomaly), method = "gam", se = FALSE) +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24))
Emphasis on trend
Emphasis on positive vs negative deviation
Emphasis on positive vs negative deviation, also on time spent above or below
8. Encourage comparison between different pieces of data
temps <-
bind_rows(
bind_cols(hemi = "Southern", temps_southern),
bind_cols(hemi = "Northern", temps_northern)
)
ggplot(temps) +
geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24),
legend.position = "inside",
legend.position.inside = c(0.15, 0.7))
ggplot(temps) +
geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) +
geom_step(aes(x = Year, y = Anomaly, color = hemi), direction = "mid") +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24),
legend.position = "inside",
legend.position.inside = c(0.15, 0.7))
ggplot(temps) +
geom_point(aes(x = Year, y = Anomaly, color = hemi), size = 1) +
geom_smooth(aes(x = Year, y = Anomaly, color = hemi), method = "gam", se = FALSE) +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_color_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24),
legend.position = "inside",
legend.position.inside = c(0.15, 0.7))
ggplot(temps) +
geom_col(aes(x = Year, y = Anomaly, fill = hemi), position = "dodge") +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_fill_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24),
legend.position = "inside",
legend.position.inside = c(0.15, 0.7))
ggplot(temps) +
geom_ribbon(aes(x = Year, ymin = 0, ymax = Anomaly, fill = hemi), alpha = 0.6) +
scale_x_continuous(name = NULL, expand = expansion(0.01)) +
scale_fill_manual(name = NULL, values = c("#1B9E77", "#D95F02")) +
scale_y_continuous(name = "Temperature Anomaly (C)", n.breaks = 6) +
theme(text = element_text(size = 24),
legend.position = "inside",
legend.position.inside = c(0.15, 0.7))
9. Serve clear purpose: description, exploration, tabulation or decoration
Description: Total number of followers Twiter Blue account has as of early November, 2022 against day joining twitter
Scan and commentary from https://junkcharts.typepad.com/junk_charts/2022/12/the-blue-mist.html; original article at https://www.nytimes.com/interactive/2022/11/23/technology/twitter-elon-musk-twitter-blue-check-verification.html
What is the purpose? Hard to know whether blue users have a lot of followers; would expect time bias but there doesn’t seem to be one (no funnel effect); why are some points labeled?
No Spice: Make the Global Temperature Anomaly plots on slide 24
Weak Sauce: Make one of the plots on slides 25-28
Medium Spice: Make one of the plots on slides 30-34
Dim Mak: Make the plot on the next slide
detroit_temps <-
bind_cols(city = "Detroit",
# Your path will be different than mine: this is the
# relative path with respect to my slides
# I'm using skip = 4 to skip the first 4 lines
read_csv("../../CityTemps/detroit.csv", skip = 4)) %>%
# One temperature is -99. Presumably this is a flag for missingness
filter(Value != -99) %>%
# Split up the date variable
separate_wider_position(cols = Date, widths = c("Year" = 4, "Month"= 2)) %>%
mutate(Year = factor(Year),
MonthNum = as.numeric(Month),
# Map integers to month abbrevations:
MonthFct = c("Jan","Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct","Nov", "Dec")[MonthNum] %>% fct_inorder(ordered = T))
detroit_temps_avg <-
detroit_temps %>%
group_by(MonthFct) %>%
summarize(Value = mean(Value))
ggplot(detroit_temps,
aes(x = MonthFct, y = Value)) +
geom_line(aes(group = Year), alpha = 0.35) +
geom_line(data = filter(detroit_temps, Year == "2024"), aes(group = 1), color = "darkred", size = 2) +
geom_line(data = detroit_temps_avg, aes(group = 1), linetype = "dashed", size = 2) +
labs(x = NULL, y = "Temp (F)") +
theme(text = element_text(size = 24))
Faint lines = Year; Dark red line = 2024; Dashed line = Average
Doll, R., 1955. Etiology of lung cancer. In Advances in cancer research (Vol. 3, pp. 1-50).
Friendly, M. and Denis, D., 2005. The early origins and development of the scatterplot. Journal of the History of the Behavioral Sciences, 41(2), pp.103-130.
Tufte, E.R., 2001. The visual display of quantitative information.