ECON 0150 | Economic Data Analysis

The economist’s data analysis skillset.


Part 2.1 | Data Cleaning

Data Cleaning

Q. Are students who live further away older?

Data Cleaning

Q. Are students who live further away older?

Let’s examine age and distance from Pittsburgh.

Data Cleaning

Q. Are students who live further away older?

Let’s examine age and distance from Pittsburgh.

> the birthday data is stored as text: “08/15/2005”

> we need to extract the year to calculate age

String Parsing

Extracting useful information from text

What we have: “08/15/2005”

What we need: 2005

Distance from Pittsburgh

Q. Are students who live further away older?

> lots of different formats!

Distance from Pittsburgh

Answers can be in many creative forms…

  • “0 miles”
  • “~500”
  • “about 1000”
  • “2.5 hours”
  • “very far”

> computers can’t do math with text

Type Conversion

Converting text to numbers

We can convert text to numbers, forcing errors to become NA.

> entries like “very far” become NA

> entries like “500” become 500.0

Missing Values

What happened to the non-numeric entries?

new Approximately how many miles away from Pittsburgh is your hometown?
0 400.0 400
1 16.0 16
2 300.0 300
3 300.0 300
4 400.0 400

Missing Values

What happened to the non-numeric entries?

new Approximately how many miles away from Pittsburgh is your hometown?
6 NaN 176 miles away
17 NaN 0 (it’s Pittsburgh)
18 NaN 400-450ish miles
22 NaN 350 miles
23 NaN 240 miles

> they all became NaN (Not Available)

> we need to decide what to do with them

Handling Missing Values

Two main approaches

After replacing problematic values, there are generally two options.

Option 1: Drop the missing values

  • Removes entire rows with NA
  • Reduces sample size
  • Simple and clean

Option 2: Replace with a value

  • Fill with 0, mean, or median
  • Keeps sample size
  • May introduce bias

> for distance, dropping makes sense - we can’t guess locations

After Cleaning

Q. Are students who live further away older?

> as expected, there does not seem to be much of a relationship

Summary

Some common data cleaning operations



  • String Parsing: Extract information from text
  • Type Conversion: Change text to numbers
  • Missing Values: Drop or replace NAs

Exercise 2.1 | Data Cleaning



Let’s find the median birthyear and the mean hometown distance from Pittsburgh.

  • Data: Fall_2025_Survey_raw.csv

Exercise 2.1 | Birthday to Birthyear

Extract year from birthday text

# Convert birthday to datetime
survey['Birthday'] = pd.to_datetime(survey['When is your birthday?'])
# Extract year from date
survey['Birthyear'] = survey['Birthday'].dt.year
sns.histplot(survey, x='Birthyear')

Exercise 2.1 | Distance Conversion (Simple)

Convert distance text to numbers

# Convert to numeric, errors become NA
survey['Distance'] = pd.to_numeric(survey['Approximately how many miles away from Pittsburgh is your hometown?'], errors='coerce')
# Check how many became NA
survey['Distance'].isna().sum()

Exercise 2.1 | Handle Missing Values

Two approaches to NAs

Drop missing values:

# Remove rows where distance is NA
survey_dropped = survey.dropna()

Replace with a value:

# Replace NA with median distance
median_dist = survey['Distance'].median()
survey['Distance_median'] = survey['Distance'].fillna(median_dist)
# Or replace with 0
survey['Distance_zero'] = survey['Distance'].fillna(0)

Exercise 2.1 | Distance Conversion (Replace)

Convert distance text to numbers

# Replace non-numeric
replacements = {
    '400-450ish miles ': 400, 
    'live in pittsburgh': 0,
    '176 miles away': 176,
    '0 (it’s Pittsburgh)': 0,
    '350 miles': 350,
    '240 miles': 240,
    '388 miles': 388,
    '17 miles': 17,
    '300 miles': 300,
    '7293 mi': 7293,
    '4 miles ': 4,
    '27 miles': 27,
    '255 (near Philly)': 255,
    '4,000': 4000,
    '650 miles': 650,
    '250 miles': 250,
    '318 mi': 318,
    '300 mi': 300,
    '1000+': 1000,
    '305 miles': 305
}
data['Distance_Clean'] = data['Approximately how many miles away from Pittsburgh is your hometown?'].replace(replacements)
data['Distance_Clean'] = pd.to_numeric(data['Distance_Clean'])
# Check how many became NA
survey['Distance_Clean'].isna().sum()

Exercise 2.1 | Scatterplot

Check that it worked

# Create scatterplot
sns.scatterplot(data, x='Distance', y='age')