
The economist’s data analysis skillset.
Which hospital would you choose?
Hospital A has a higher survival rate, seems like the obvious choice.
Hospital A looks better…
But is this the whole story?
Which hospital would you choose?
What if we break down survival rate by case severity?
Hospital B is better in both groups!
Mild cases: B has 98% vs A has 95%
Severe cases: B has 60% vs A has 50%
Why does this happen?
Hospital A treats mostly mild cases, Hospital B treats mostly severe cases.
A trend in aggregated data reverses when the data is split into groups.
Why we need to control for other variables
Lets use the general linear model to test for differences in wages by gender.
Questions:
The simplest model with just a gender indicator.

The simplest model with just a gender indicator.

\[\text{Wage} = \beta_0 + \beta_1 \times \text{Male} + \varepsilon\]
The simplest model with just a gender indicator.
\[\text{Wage} = \beta_0 + \beta_1 \times \text{Male} + \varepsilon\]
Implementing the basic gender gap model
import statsmodels.formula.api as smf
# Fit the model with just the male indicator
model1 = smf.ols('INCLOG10 ~ MALE', data=data).fit()
print(model1.summary().tables[1])Adding education as a control variable.

Adding education as a control variable.

Adding education as a control variable.

\[\text{Wage} = \beta_0 + \beta_1 \times \text{Education} + \beta_2 \times \text{Male} + \varepsilon\]
Adding education as a control variable.
\[\text{Wage} = \beta_0 + \beta_1 \times \text{Education} + \beta_2 \times \text{Male} + \varepsilon\]
> β₀ is the base wage for those with no post-middle school education
> β₂ represents the gender wage gap, added to the intercept for male only
> model assumes parallel lines, same returns to education (β₁) for everyone
Implementing the gender fixed effect model
import statsmodels.formula.api as smf
# Fit the model with male indicator
model2 = smf.ols('INCLOG10 ~ EDU + MALE', data=data).fit()
print(model2.summary().tables[1])What we learned in Part 5.1