The economist’s data analysis skillset.
Comparing numerical values across entities
> Cross-sectional data: many entities, one point in time
> Numerical variables: values you can do math with (age, income, consumption)
> Key question: How is this variable distributed?
Choose based on sample size and what you want to see
| Tool | Best for | Shows |
|---|---|---|
| Histogram | Many observations | Shape of distribution |
| Boxplot + Stripplot | Fewer observations | Quartiles + individual values |
Use when you have many observations
Q. Which age group has the most Starbucks customers?
> the bin sizes aren’t even, making it hard to interpret
Q. Which age group has the most Starbucks customers?
> the bin sizes aren’t even, making it hard to interpret
Q. Which age group has the most Starbucks customers?
> but what if we want to distinguish between a 55 year old and a 60 year old?
Q. Which age group has the most Starbucks customers?
> what if we take this even further?
> what if we compare 44 year olds to 45 year olds?
Q. Do 44 or 45 year olds spend more at Starbucks?
> we can go too far, introducing statistical noise. how do we fix the problem?
> increase the sample size or the bin width!
Q. Which age group has the most Starbucks customers?
> larger sample has less noise!
Q. Which age group has the most Starbucks customers?
> larger bins also has less noise!
Two numbers that summarize a histogram
Q. What is the average age of Starbucks customers?
> Mean ≈ 56 years; SD ≈ 17 years
> “The average customer is about 56; ages typically vary by about 17 years from that average”
… use the right summary tool for the variable type
What we just did
| Step | Action |
|---|---|
| SELECT | All Starbucks customers |
| TRANSFORM | Count customers within each age bin |
| ENCODE | Bin → x-position; Count → bar height |
> TRANSFORM for histograms = count within bins
Q. Which age group among those making $40k or less has the most Starbucks customers?
Lets use the data to examine whether customers between 45 - 55 years old spend the most among customers making less than $40k.
Data: Starbucks_Customer_Profiles_40k.csv
Q. Which age group among those making $40k or less has the most Starbucks customers?

Summarize the distribution with two numbers
> Mean tells us the center; SD tells us the spread
“The average customer is about 48 years old; ages typically vary by about 18 years from that average.”
Histograms need many data points to show shape
> With few observations, histogram bins become noisy or empty
> We need a different tool: boxplots + stripplots
Q. Which countries drank an average amount of coffee?
> histogram bins make it impossible to see exact values or quartiles
Q. Which countries drank the most coffee in 1999?
> as we’ll see, boxplots can tell us about quartiles
> but boxplots are still pretty unclear for our question
Q. Which countries drank the most coffee in 1999?
> here we can see the datapoints directly with the boxplot
> each point represents a country’s coffee consumption
Q. Which countries drank the most coffee in 1999?
> each element of the boxplot represents one of these five quartiles
Which countries consumed more than 8 kg per capita?
Which countries consumed more than 8 kg per capita?
> we can highlight the relevant subsets of the data
Which country consumed the most coffee per capita?
> we can find the exact values according to quartiles
Which country consumed the most coffee per capita?
> we can find the exact values according to quartiles
> Finland consumed the most coffee per capita in 1999
Which country consumed the least coffee per capita?
Which country consumed the least coffee per capita?
> Russia consumed the least coffee per capita in 1999
How about the median?
How about the median?
> the US!
Which country consumes more than exactly 25% of countries?
Which country consumes more than exactly 25% of countries?
> Slovakia!
Which country consumes more than exactly 75% of countries?
Which country consumes more than exactly 75% of countries?
> Netherlands
Boxplots show quartiles; stripplots show the data.
Boxplots show quartiles; stripplots show the data.
What we just did
| Step | Action |
|---|---|
| SELECT | All coffee-importing countries in 1999 |
| TRANSFORM | Calculate quartiles (min, Q1, median, Q3, max) |
| ENCODE | Quartile → box position; Value → point position |
> TRANSFORM for boxplots = calculate quartiles
Show the distribution of coffee consumption per capita in 2019.
Lets use a boxplot and stripplot to examine the distribution of coffee consumption per capita among coffee-importing countries in 2019.
Coffee_Per_Cap_2019.csvShow the distribution of coffee consumption per capita in 2019.

Calculate the five-number summary
> These five numbers define the boxplot: min, Q1, median, Q3, max
What this unit adds to your toolkit
| Block | New in 1.2 |
|---|---|
| Variables | Numerical |
| Structures | Cross-section |
| Operations | Bin, Mean, SD, Quartiles |
| Visualizations | Histogram, Boxplot, Stripplot |
> Next: lets add a time dimension