The economist’s data analysis skillset.
Exploring real questions with real data
Lets start working through a project together:
> this is the kind of analysis you’ll do for your final project
We have two separate datasets that need to be connected
Dataset 1: Presidential election results by county
Dataset 2: Median household income by county
Q. What do county-level vote shares look like?
First understand what we’re working with.
Q. How did county vote shares change between 2020 and 2024?
Second, lets look at the relationship between Democratic Share in 2020 and 2024.
# Scatterplot comparing elections
sns.scatterplot(elections, x='2020', y='2024', alpha=0.5)
# Add 45-degree line
plt.plot([0,1], [0,1], 'r--', alpha=0.5)> points above the line shifted more Democratic
> points below the line shifted more Republican
> but what explains these shifts?
Q. Does county income relate to voting shifts?
To answer this, we need to:
> but how do we connect two separate datasets?
Combining datasets based on common identifiers
The Key: Find a common column that uniquely identifies observations
> example: Allegheny County, PA = 42003
Different ways to combine datasets
| Merge Type | Description | Example |
|---|---|---|
| 1:1 | Each row in A matches exactly one row in B | County → County |
| 1:m | One row in A matches multiple rows in B | State → Counties |
| m:1 | Multiple rows in A match one row in B | Counties → State |
> our county merge is 1:1 - each county appears once in each dataset
Combining our datasets
# Merge datasets on county FIPS
data = pd.merge(elections,
income,
left_on='county_fips',
right_on='county_fips',
how='inner')Merge options:
inner: Keep only counties in both datasetsleft: Keep all counties from elections dataright: Keep all counties from income dataouter: Keep all counties from either dataset> we use ‘inner’ to focus on counties with complete data
Creating our analysis variable
> now we can explore the relationship with income
Q. Does county income relate to voting shifts?
# Scatterplot of income vs vote shift
sns.scatterplot(data,
x='median_income',
y='dem_shift',
alpha=0.3)
# Add horizontal line at zero
plt.axhline(y=0, color='r', linestyle='--', alpha=0.5)
plt.xlabel('Median Household Income ($)')
plt.ylabel('Change in Democratic Vote Share (2024-2016)')> what patterns do you see?
Watch out for these problems
Merging allows us to answer richer questions
Part 2: Data Operations Practice
Which products remain after filtering?
| Product_ID | Category | Price | In_Stock |
|---|---|---|---|
| P001 | Electronics | 299 | True |
| P002 | Clothing | 49 | False |
| P003 | Electronics | 89 | True |
| P004 | Food | 12 | True |
| P005 | Clothing | 79 | True |
Filter: (Price < 100) AND (In_Stock == True)
Answer: P003, P004
Track data through multiple transformations
| Sale_ID | Store | Amount |
|---|---|---|
| S001 | North | 120 |
| S002 | South | 80 |
| S003 | North | 150 |
| S004 | South | 90 |
| S005 | North | 100 |
Operations:
Answer: North, 125
What cleaning is needed for each entry?
| Response_ID | Duration |
|---|---|
| R001 | “5 minutes” |
| R002 | “180” |
| R003 | “about 3 min” |
| R004 | “N/A” |
For each entry, select ALL that apply:
R001: [Extract number] [Remove text] [Convert type] [Handle missing] [Already clean]
R002: [Extract number] [Remove text] [Convert type] [Handle missing] [Already clean]
R003: [Extract number] [Remove text] [Convert type] [Handle missing] [Already clean]
R004: [Extract number] [Remove text] [Convert type] [Handle missing] [Already clean]
Construct the correct boolean logic
Goal: Find all employees who:
Use these components to construct a filter:
Answer: (1 OR 3) AND 2 AND 4
Why transform data?
Scenario: Comparing test scores across different schools where class sizes vary dramatically (10-50 students)
You have:
Which transformation makes schools comparable?
Answer: c) Creates average score per student
What will the grouped data look like?
| Order_ID | Customer | Amount | Region |
|---|---|---|---|
| O001 | Alice | 50 | East |
| O002 | Bob | 30 | West |
| O003 | Alice | 70 | East |
| O004 | Charlie | 40 | East |
| O005 | Bob | 60 | West |
We’ve Grouped by Customer then Summed by Amount.
How many rows in output? _____
What’s the sum for Bob? _____
Which customer has highest total? _____
Answers: 3 rows, 90, Alice (120)
Key concepts to remember
Filtering:
Grouping:
Key concepts to remember
Transformations:
Data Cleaning: