The economist’s data analysis skillset.
Let’s summarize what we know about four datasets.
Let’s summarize what we know about four datasets.
> same means, same standard deviations, same correlation between x and y…
> are these datasets the same?
Are these the same datasets?
Lets take a look at the data in a Notebook!
Are these the same datasets?
> very different! summarizing variables isn’t enough
Summary statistics can hide important patterns
Q. Is there a relationship between GDP and coffee production?
> maybe, but it’s hard to see
> lets use a two dimensional graph
Q. Is there a relationship between GDP and coffee production?
> two dimensions is nice, but the points have no meaningful relationships
Q. Is there a relationship between GDP and coffee production?
> a scatterplot effectively visualizes scross sectional data with two dimensions
Which countries have a GDP above $2 trillion?
> look at the horizontal axis and select all that are greater than 2
Which countries have a GDP above $2 trillion?
> look at the horizontal axis and select all that are greater than 2
Which countries have a production above ½ billion kg?
> and we can use either axis
Which countries have a production above ½ billion kg?
> and we can use either axis
Which countries produce less coffee per dollar than Brazil?
> we can also compare BETWEEN data points
Which countries produce less coffee per dollar than Brazil?
> we can also compare BETWEEN data points
Which countries produce less coffee per dollar than Brazil?
> separating lines can help make comparisons between ratios
Which countries produce less coffee per dollar than Brazil?
> separating lines can help make comparisons between ratios
Which countries produce more coffee per dollar than Brazil?
> separating lines can help make comparisons between ratios
Which countries produce more coffee per dollar than Brazil?
> separating lines can help make comparisons between ratios
Do the GDPs of the upper or lower pair differ by a larger amount?
> use the differences on the horizontal axis to measure differences
Which is larger: the ratio of GDPs of the upper or lower pair?
> this question is difficult to answer with this scale
Which is larger: the ratio of GDPs of the upper or lower pair?
> a log scale makes RATIOS easier to visualize: each tick is 10x larger
Which country produces the second highest output of coffee?
> a log scale also makes it easier to see SCALING
Which country produces the second highest output of coffee?
> scaling the vertical axis in logs clarifies both small and large variation
Linear vs Log Scale: Same data, different views
> log scales reveal patterns hidden by outliers in linear scale
So far we’ve encoded two variables using position
We can use SIZE to encode a third numerical variable
> our standard scatterplot with position encoding
Each point’s SIZE now represents population
> larger bubbles = larger population
Indonesia and Brazil stand out — large countries with high production
> we can now see three variables at once: GDP, production, AND population
We can also use COLOR to encode a third numerical variable
Each point’s COLOR now represents population
> darker points = larger population
Color makes it easy to spot high-population countries
> Brazil, Indonesia, and Ethiopia stand out as darker points
Visualizing GDP and Coffee Production Relationships
Was the relationship between coffee production and GDP different in 1980?
Beans_GDP_1980.csvHow does GDP relate to coffee production?
Alternative: use log scale without transforming
How does GDP relate to coffee production?
How does GDP relate to coffee production?
How do the two commodity prices relate to each other?
> difficult to tell because of the axis scale
How do the two commodity prices relate to each other?
In which years did oil and coffee prices move in opposite directions?
In which years did oil and coffee prices move in opposite directions?
But are the two prices positively or negatively related to each other?
> this is difficult to see with just a Multi-Lineplot…
But are the two prices positively or negatively related to each other?
> a Scatterplot can show the relationship between two variables through time
Does the price of oil determine the price of coffee?
> a Scatterplot can only show associations not causation :(
Visualizing Coffee Prices and Oil Prices
We’re going to use a scatterplot to visually examine the relationship between coffee prices and oil prices.
Coffee_Oil.csvVisualizing Coffee Prices and Oil Prices