| new | Approximately how many miles away from Pittsburgh is your hometown? | |
|---|---|---|
| 0 | 400.0 | 400 |
| 1 | 16.0 | 16 |
| 2 | 300.0 | 300 |
| 3 | 300.0 | 300 |
| 4 | 400.0 | 400 |
The economist’s data analysis skillset.
Q. Are students who live further away older?
Q. Are students who live further away older?
Let’s examine age and distance from Pittsburgh.

Q. Are students who live further away older?
Let’s examine age and distance from Pittsburgh.
> the birthday data is stored as text: “08/15/2005”
> we need to extract the year to calculate age
Extracting useful information from text
What we have: “08/15/2005”
What we need: 2005

Q. Are students who live further away older?
> lots of different formats!
Answers can be in many creative forms…
> computers can’t do math with text
Converting text to numbers
We can convert text to numbers, forcing errors to become NA.
> entries like “very far” become NA
> entries like “500” become 500.0
What happened to the non-numeric entries?
| new | Approximately how many miles away from Pittsburgh is your hometown? | |
|---|---|---|
| 0 | 400.0 | 400 |
| 1 | 16.0 | 16 |
| 2 | 300.0 | 300 |
| 3 | 300.0 | 300 |
| 4 | 400.0 | 400 |
What happened to the non-numeric entries?
| new | Approximately how many miles away from Pittsburgh is your hometown? | |
|---|---|---|
| 6 | NaN | 176 miles away |
| 17 | NaN | 0 (it’s Pittsburgh) |
| 18 | NaN | 400-450ish miles |
| 22 | NaN | 350 miles |
| 23 | NaN | 240 miles |
> they all became NaN (Not Available)
> we need to decide what to do with them
Two main approaches
After replacing problematic values, there are generally two options.
Option 1: Drop the missing values
Option 2: Replace with a value
> for distance, dropping makes sense - we can’t guess locations
Q. Are students who live further away older?
> as expected, there does not seem to be much of a relationship
Some common data cleaning operations
Let’s find the median birthyear and the mean hometown distance from Pittsburgh.
Fall_2025_Survey_raw.csvExtract year from birthday text
Convert distance text to numbers
Two approaches to NAs
Replace with a value:
Convert distance text to numbers
# Replace non-numeric
replacements = {
'400-450ish miles ': 400,
'live in pittsburgh': 0,
'176 miles away': 176,
'0 (it’s Pittsburgh)': 0,
'350 miles': 350,
'240 miles': 240,
'388 miles': 388,
'17 miles': 17,
'300 miles': 300,
'7293 mi': 7293,
'4 miles ': 4,
'27 miles': 27,
'255 (near Philly)': 255,
'4,000': 4000,
'650 miles': 650,
'250 miles': 250,
'318 mi': 318,
'300 mi': 300,
'1000+': 1000,
'305 miles': 305
}
data['Distance_Clean'] = data['Approximately how many miles away from Pittsburgh is your hometown?'].replace(replacements)
data['Distance_Clean'] = pd.to_numeric(data['Distance_Clean'])Check that it worked