concept_1

ECON 0150 | Economic Data Analysis

The economist’s data analysis pipeline.

Part 1.4 | Panel Data

Panel Data: Long Format vs Wide Format

Panel data comes in one of two formats.

Panel Data in Long Format uses lists each entry as a row, using a column (eg. Shop) to record the group.

	Hours	Shop
transaction_id
7	12	Shop A
11	15	Shop A
19	14	Shop A
32	16	Shop A
33	19	Shop A

Panel Data: Long Format vs Wide Format

Panel data comes in one of two formats.

Panel Data in Wide Format uses lists each group as a row, using a column to record each entry.

	1999	2019
Code
AUT	8.430589	7.925747
BGR	2.652661	3.638313
HRV	4.480790	5.623266
CYP	3.477888	5.615070
CZE	3.255587	4.739563

Hiring a Barista

Use Coffee_Sales_Receips.csv to help inform where to hire a barista.

You manage three coffee shops and are considering where to hire a new barista.
You have a dataset containing information about the transactions taking place at all three coffee shops throughout the day.
Lets consider how to use this data to inform our decision.

Hiring a Barista

Q. Which coffee shop is the busiest?

Hiring a Barista: Bar Graphs Compare Shops

Q. Which coffee shop is the busiest?

> a bar chart makes it easy to compare shops’ busyness

Hiring a Barista

Q. What time of day is the busiest?

Hiring a Barista: Histograms Can Compare Times

Q. What time of day is the busiest?

> a histogram makes it easy to compare transactions by time of day

> does this mean the morning shift at Shop A is the busiest?

Hiring a Barista

Q. Which shift is the busiest?

Hiring a Barista: Transactions by Shop

Q. Which shift is the busiest?

> an overlaid histogram can show all three groups

> does this show the data clearly?

Hiring a Barista: Transactions by Shop

Q. Which shift is the busiest?

> instead, lets use a line graph

Part 1.4 | Panel Data Using Line Graphs

Summary

Categorical variables and continuous variables can give us different views of the same data.
We can visualize both views one the same graph.
Line graphs help simplify the visualization of multiple categories.

Exercise 1.4 | Coffee Shop Transactions

Use Coffee_Sales_Receips.csv to help inform where to hire a barista.

# Load Dataset
sales = pd.read_csv(file_path + 'Coffee_Sales_Reciepts.csv')
sales.head()

	Hours	Shop
0	12	Shop A
1	15	Shop A
2	14	Shop A
3	16	Shop A
4	19	Shop A

> you’ll see a few more columns in your dataset

> this is Long-Format Panel Data: transactions are all in the same column

Exercise 1.4 | Bar Chart

Use Coffee_Sales_Receips.csv to help inform where to hire a barista.

# Bar graph
sns.countplot(sales, x='Shop', hue='Shop')

Exercise 1.4 | Histogram

Use Coffee_Sales_Receips.csv to help inform where to hire a barista.

# Create a histogram
sns.histplot(sales, x='Hours', bins=range(0,24,1))

Exercise 1.4 | Multi-Histogram

Use Coffee_Sales_Receips.csv to help inform where to hire a barista.

# Create a multi-histogram
sns.histplot(sales, x='Hours', hue='Shop', bins=range(0,24,1))

Exercise 1.4 | Count Hourly by Shop

Q. Which shift is the busiest?

# Create hourly counts by shop
hourly_counts = sales.groupby(['Shop', 'Hours']).size().reset_index(name='Count')

	Shop	Hours	Count
0	Shop A	7	1383
1	Shop A	8	1632
2	Shop A	9	1693
3	Shop A	10	1711
4	Shop A	11	1136

# Multi-Lineplot
sns.lineplot(hourly_counts, x='Hours', y='Hours', hue='Shop')

Exercise 1.4 | Multiple Line Graph

Q. Which shift is the busiest?

# Multiple-Line Graph
sns.histplot(sales, x='Hours', hue='Shop', bins=range(0,24,1), element='poly')

Exercise 1.4 | Multiple Line Graph

Q. Which shift is the busiest?

# Multiple-Line Graph
sns.histplot(sales, x='Hours', hue='Shop', bins=range(0,24,1), element='poly', fill=False)

Panel Data: Coffee Consumption Per Capita

Is the world drinking more coffee?

Lets examine whether the world is drinking more coffee today than in the 1990s.

Data: Coffee_Per_Cap.csv

Panel Data: Coffee Consumption Per Capita

Is the world drinking more coffee?

> compared to what…?

Panel Data: Coffee Consumption Per Capita

Is the world drinking more coffee?

> this is still pretty unclear: histograms aren’t great for comparison

> lets use a multi-boxplot

Panel Data: Multi-Boxplots

Is the world drinking more coffee?

> this is better: it looks like the distribution is shifted higher!

> lets examine the years in between to see how the distribution evolved

Panel Data: Multi-Boxplots

Is the world drinking more coffee?

> lets ask some smaller more focussed questions

Panel Data: Multi-Boxplots

Which years show at least half consuming less than 5 kg per cap?

Panel Data: Multi-Boxplots

Which years show at least half consuming less than 5 kg per cap?

> focus on the medians

Panel Data: Multi-Boxplots

Which years show at least half consuming less than 5 kg per cap?

> … when the median is above 5 kg per cap

Panel Data: Multi-Boxplots

Which years saw the largest jump in the median?

> … a little difficult to see

Panel Data: Multi-Boxplots

Which years saw the largest jump in the median?

> … a little difficult to see

Panel Data: Multi-Boxplots

Is the country with the lowest consumption consuming more today?

Panel Data: Multi-Boxplots

Is the country with the lowest consumption consuming more today?

> focus on the minimums

> yes!

Panel Data: Multi-Boxplots

What patterns do we observe about the maximums?

> same with the maximums

Panel Data: Multi-Boxplots

Which years did more than 25% consume less than 5 kg?

Panel Data: Multi-Boxplots

Which years did more than 25% consume less than 5 kg?

> look at the 25%

Panel Data: Multi-Boxplots

Which years did more than 25% consume less than 5 kg?

> look at the 25% and compare it to 5 kg per cap

Panel Data: Multi-Boxplots

Which years did more than 25% consume less than 5 kg?

> all of them

Panel Data: Multi-Boxplots

Which year saw the greatest difference between any two countries?

> look at the range

Panel Data: Multi-Boxplots

Which year saw the greatest difference between any two countries?

> look at the range

Panel Data: Multi-Boxplots

Which year saw the greatest difference between any two countries?

> look at the range and select the largest

Exercise 1.4 | Multi-Boxplots

Is the world drinking more coffee?

We’re going to use a set of boxplots to visually compare across years the distributions of coffee consumption per capital among coffee importing countries.

Data: Coffee_Per_Cap.csv

	Code	1999	2009	2019
0	AUT	8.430589	6.371562	7.925747
2	BGR	2.652661	3.296419	3.638313
3	HRV	4.480790	5.100831	5.623266
4	CYP	3.477888	4.050500	5.615070
5	CZE	3.255587	3.016104	4.739563

> this is Wide-Format Panel Data: each year is in a separate column

Exercise 1.4 | Multi-Boxplots

Is the world drinking more coffee?

With wide-format panel data seaborn looks a little different.

# Wide Format Multi-Boxplot
sns.boxplot(percap[['1999','2004','2009','2014','2019']], orient='h', whis=(0, 100))

Panel Data: Multi-Boxplots

In which year did most countries increase their coffee consumption?

> not visible in the figure!

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

> also not visible with this figure!

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

> better, but this figure still doesn’t let us keep track of countries between years…

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

> a scatter plot can visualize changes between two points in time

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

> a 45 degree line shows all the possible points with no change

Panel Data: Relationships Between Years

How many countries increased their coffee consumption between 1999 and 2019?

> points above the line increased

Panel Data: Relationships Between Years

How many countries decreased their coffee consumption between 1999 and 2019?

> points below the line decreased

Panel Data: Relationships Between Years

Does the data confirm that the world is drinking more coffee?

> we can use colors to visualize both increases and decreases

Exercise 1.4 | Scatterplots

Is the world drinking more coffee?

We’re going to use a scatterplot to visually examine how countries’ coffee consumption changed between 1999 and 2019.

Data: Coffee_Per_Cap.csv

Exercise 1.4 | Scatterplots

Is the world drinking more coffee?

# Wide Format Scatterplot
sns.scatterplot(percap, x='1999', y='2019')

Exercise 1.4 | Scatterplots

Is the world drinking more coffee?

# Wide Format Scatterplot
sns.scatterplot(percap, x='1999', y='2019')

Part 1.4 | Panel Data Using Scatterplots

Summary

Multi-Boxplots can help visualize changes in the distribution, but cannot track individual changes.
Scatterplots can show how repeated observations change through time within a single unit.
A 45 degree line and colors can help visually communicate changes.