The economist’s data analysis skillset.
Each row represents an observation of an entity at one point in time
| Code | Year | Consumption |
|---|---|---|
| AUT | 1990 | 10.47 |
| AUT | 1991 | 10.07 |
| AUT | 1992 | 9.27 |
| AUT | 1993 | 10.13 |
| AUT | 1994 | 8.21 |
| … | … | … |
> this is data on coffee consumption per capita, 34 countries, 1990 to 2019
Are countries drinking more coffee in 2019 than in 1999?
| Code | Year | Consumption |
|---|---|---|
| AUT | 1990 | 10.47 |
| AUT | 1991 | 10.07 |
| AUT | 1992 | 9.27 |
| AUT | 1993 | 10.13 |
| AUT | 1994 | 8.21 |
| … | … | … |
> how can we use this data to answer our question?
> it’s challenging… the information we need is spread across different rows
Are countries drinking more coffee?
> readable, but not great for answering our question
Is the world drinking more coffee?
> compared to what…?
Is the world drinking more coffee?
> this is still pretty unclear: histograms aren’t great for comparison
> lets use a multi-boxplot
Is the world drinking more coffee?
> this is better: it looks like the distribution is shifted higher!
> lets examine the years in between to see how the distribution evolved
Is the world drinking more coffee?
> lets ask some smaller more focussed questions
Which years show at least half consuming less than 5 kg per cap?
Which years show at least half consuming less than 5 kg per cap?
> focus on the medians
Which years show at least half consuming less than 5 kg per cap?
> … when the median is above 5 kg per cap
Which years saw the largest jump in the median?
> … a little difficult to see
Which years saw the largest jump in the median?
> … a little difficult to see
Is the country with the lowest consumption consuming more today?
Is the country with the lowest consumption consuming more today?
> focus on the minimums
> yes!
What patterns do we observe about the maximums?
> same with the maximums
Which years did more than 25% consume less than 5 kg?
Which years did more than 25% consume less than 5 kg?
> look at the 25%
Which years did more than 25% consume less than 5 kg?
> look at the 25% and compare it to 5 kg per cap
Which years did more than 25% consume less than 5 kg?
> all of them
Which year saw the greatest difference between any two countries?
> look at the range
Which year saw the greatest difference between any two countries?
> look at the range
Which year saw the greatest difference between any two countries?
> look at the range and select the largest
In which year did most countries increase their coffee consumption?
> not visible in the figure!
How many countries increased their coffee consumption between 1999 and 2019?
> also not visible with this figure!
How many countries increased their coffee consumption between 1999 and 2019?
> better, but this figure still doesn’t let us keep track of countries between years…
How many countries increased their coffee consumption between 1999 and 2019?
> we need a way to compare each country’s consumption in 1999 vs 2019
> but in long format, this is hard — the data is spread across different rows
What if each year was its own column?
| Code | 1999 | 2004 | 2009 | 2014 | 2019 |
|---|---|---|---|---|---|
| AUT | 8.43 | 7.31 | 6.37 | 7.97 | 7.93 |
| BGR | 2.65 | 2.83 | 3.30 | 3.12 | 3.64 |
| HRV | 4.48 | 5.16 | 5.10 | 5.21 | 5.62 |
| CYP | 3.48 | 3.53 | 4.05 | 4.13 | 5.62 |
| CZE | 3.26 | 3.56 | 3.02 | 5.69 | 4.74 |
| … | … | … | … | … | … |
> now comparing 1999 vs 2019 is just comparing two columns!
Each year is a column
Each observation is a row
Panel data can be stored in two ways
How many countries increased their coffee consumption between 1999 and 2019?
How many countries increased their coffee consumption between 1999 and 2019?
> a scatter plot can visualize changes between two points in time
How many countries increased their coffee consumption between 1999 and 2019?
> a 45 degree line shows all the possible points with no change
How many countries increased their coffee consumption between 1999 and 2019?
> points above the line increased
How many countries decreased their coffee consumption between 1999 and 2019?
> points below the line decreased
Does the data confirm that the world is drinking more coffee?
> we can use colors to visualize both increases and decreases
Is the world drinking more coffee?
We’re going to use a scatterplot to visually examine how countries’ coffee consumption changed between 1999 and 2019.
Coffee_Per_Cap.csvIs the world drinking more coffee?
Is the world drinking more coffee?
How many countries increased vs decreased?
> we can see visually that most points are above the 45° line
> but how do we count exactly how many?
How many countries increased vs decreased?
| Code | 1999 | 2019 | change |
|---|---|---|---|
| AUT | 8.43 | 7.93 | -0.50 |
| BGR | 2.65 | 3.64 | 0.99 |
| HRV | 4.48 | 5.62 | 1.14 |
| … | … | … | … |
len() to count the filtered rows> positive change = increased; negative change = decreased
How many countries increased their coffee consumption?
df[df['col'] > 0] counts subsetsWhat this unit adds to your toolkit
| Block | Part 1.5 |
|---|---|
| Variables | Numerical |
| Structures | Panel (wide format) |
| Operations | Filter |
| Visualizations | Multi-boxplot, Scatterplot with 45° line |
How your toolkit grew
| Block | 1.1 | 1.2 | 1.3 | 1.4 | 1.5 |
|---|---|---|---|---|---|
| Variables | Categorical | + Numerical | |||
| Structures | Cross-section | + Timeseries | + Panel (long) | + Panel (wide) | |
| Operations | Count | + Bin, Mean, SD, Quartiles | + Real price transform | + Groupby | + Filter |
| Visualizations | Bar, Pie | + Histogram, Boxplot | + Line plot, Multi-boxplot | + Multi-line, Facets | + Scatterplot w/ 45° |
Everything you know
| Block | Part 1 |
|---|---|
| Variables | Categorical (binary, nominal, ordinal), Numerical |
| Structures | Cross-section, Timeseries, Panel (long & wide) |
| Operations | Count, Bin, Mean, SD, Quartiles, Real price transform, Groupby, Filter |
| Visualizations | Bar chart, Pie chart, Histogram, Boxplot, Stripplot, Line plot, Multi-boxplot, Multi-line, Facets, Scatterplot w/ 45° line |