concept_1

ECON 0150 | Economic Data Analysis

The economist’s data analysis pipeline.

Part 1.2 | Summarizing Numerical Variables

Summarizing Numerical Variables

… use the appropriate summary tool for the variable type

Numerical Variables: Histograms

Q. Which age group has the most Starbucks customers?

> the bin sizes aren’t even, making it hard to interpret

Numerical Variables: Histograms

Q. Which age group has the most Starbucks customers?

> the bin sizes aren’t even, making it hard to interpret

Histograms: Use equal sized bins

Q. Which age group has the most Starbucks customers?

> but what if we want to distinguish between a 55 year old and a 60 year old?

Histograms: Use narrow enough bins

Q. Which age group has the most Starbucks customers?

> what if we take this even further?

> what if we compare 44 year olds to 45 year olds?

Histograms: Avoid visualizing noise

Q. Do 44 or 45 year olds spend more at Starbucks?

> we can go too far, introducing statistical noise. how do we fix the problem?

> increase the sample size or the bin width!

Histograms: Balance resolution vs noise

Q. Which age group has the most Starbucks customers?

> larger sample has less noise!

Histograms: Balance resolution vs noise

Q. Which age group has the most Starbucks customers?

> larger bins also has less noise!

Histograms: Summary

… use the right summary tool for the variable type

Use histograms to visualize continuous variables.
Make histograms with equally sized bins.
Histograms with bins that are too narrow increase statistical noise, which can obscure underlying relationships.

Exercise 1.2: Histograms

Q. Which age group among those making $40k or less has the most Starbucks customers?

Lets use the data to examine whether customers between 45 - 55 years old spend the most among customers making less than $40k.

Data: Starbucks_Customer_Profiles_40k.csv

Exercise 1.2: Histograms

Q. Which age group among those making $40k or less has the most Starbucks customers?

# Histogram with 5 year bins
sns.histplot(customers, x='age', bins=range(20,100,5))

# Save Figure
plt.savefig('exercise_1_2_1.png')

Numerical Variables: Histograms

Q. Which countries drank an average amount of coffee?

> histogram bins make it impossible to see the exact values

Numerical Variables: Histograms

Q. Which countries drank the most coffee in 1999?

> again, histograms make it difficult to see statistical measures

Numerical Variables: Boxplots

Q. Which countries drank the most coffee in 1999?

> as we’ll see, boxplots can tell us about quartiles

> but boxplots are still pretty unclear for our question

Boxplots + Stripplots

Q. Which countries drank the most coffee in 1999?

> here we can see the datapoints directly with the boxplot

> each point represents a country’s coffee consumption

Boxplots + Stripplots

Q. Which countries drank the most coffee in 1999?

> each element of the boxplot represents one of these five quartiles

Boxplots + Stripplots

Which countries consumed more than 8 kg per capita?

Boxplots + Stripplots

Which countries consumed more than 8 kg per capita?

> we can highlight the relevant subsets of the data

Boxplots + Stripplots

Which country consumed the most coffee per capita?

> we can find the exact values according to quartiles

Boxplots + Stripplots

Which country consumed the most coffee per capita?

> we can find the exact values according to quartiles

> Finland consumed the most coffee per capita in 1999

Boxplots + Stripplots

Which country consumed the least coffee per capita?

Boxplots + Stripplots

Which country consumed the least coffee per capita?

> Russia consumed the least coffee per capita in 1999

Boxplots + Stripplots

How about the median?

Boxplots + Stripplots

How about the median?

> the US!

Boxplots + Stripplots

Which country consumes more than exactly 25% of countries?

Boxplots + Stripplots

Which country consumes more than exactly 25% of countries?

> Slovakia!

Boxplots + Stripplots

Which country consumes more than exactly 75% of countries?

Boxplots + Stripplots

Which country consumes more than exactly 75% of countries?

> Netherlands

Boxplots + Stripplots

Boxplots show quartiles; stripplots show the data.

Boxplots + Stripplots: Summary

Boxplots show quartiles; stripplots show the data.

Boxplots make it easy to show the quartiles.
Stripplots can show the distribution of the data.
We can highlight subsets of the data.

Exercise 1.2: Boxplots + Stripplots

Show the distribution of coffee consumption per capita in 2019.

Lets use a boxplot and stripplot to examine the distribution of coffee consumption per capita among coffee-importing countries in 2019.

Data: Coffee_Per_Cap_2019.csv

Exercise 1.2: Boxplots + Stripplots

Show the distribution of coffee consumption per capita in 2019.

# Boxplot with no outliers
sns.boxplot(coffee, x='Coffee_2019', whis=(0,100))

# Stripplot
sns.stripplot(coffee, x='Coffee_2019')

# Save Figure
plt.savefig('exercise_1_2_2.png')