
The economist’s data analysis skillset.
We’ve built three models. Can we trust them?

Which model do you think offers better predictions?

Which model do you think offers better predictions?

Model 1 has predictions close to the average data for all predictor values.
Model 2 will offer inaccurate predictions for large predictor variables!
Our test results are only valid when the model assumptions are valid.
Assumption violations affect our inferences
If assumptions are violated:
We use residuals (\(\epsilon\)) and estimates (\(\hat{y}\)) to check if a model is ‘specified’.
We use residuals (\(\varepsilon\)) and estimates (\(\hat{y}\)) to check if a model is ‘specified’.

Residuals (\(\varepsilon\)): the error of each prediction; how far off the model is.
Estimates (\(\hat{y}\)): the model’s predicted value for each observation.
Use a residual plot to visualize the error for the model.
The error term should be unrelated to the fitted value.
Q. Which model appears to violate the Linearity Assumption?

The error term should be unrelated to the fitted value.
Q. Which model appears to violate the Linearity Assumption?

Model 1 is equally wrong everywhere. Model 2 has larger errors at the extremes.
A non-linear relationship will produce non-linear residuals.

The linear model misses the true curvature leading to systematic errors.
Transform variables to become linear
Adding a square term or performing a log transformation can fix the problem.
instead of
\[\text{income} = \beta_0 + \beta_1 \text{age} + \varepsilon\]
we could use
\[\text{income} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{age}^2 + \varepsilon\]
It’s also common to log transform either the \(x\) or \(y\) variable.
Residuals should be spread out the same everywhere.
Q. Which one of these figures shows homoskedasticity?

Residuals should be spread out the same everywhere.
Q. Which one of these figures shows homoskedasticity?

Model 1 is equally wrong everywhere. Model 2 has errors that grow with the fitted value.
The spread of residuals should not change across values of X.

The spread of points increases as education increases!
Robust standard errors give more accurate measures of uncertainty
Use Robust Standard Errors which adjust for the changing spread, giving more accurate p-values.
Each error should be unrelated to previous errors.
Q. Which of these residual lag plots shows independence?

Each error should be unrelated to previous errors.
Q. Which of these residual lag plots shows independence?

Model 1 has no (meaningful) relationship between consecutive residuals.
Model 2 shows a positive relationship (autocorrelation).
Use a lagged residual plot to check for autocorrelation.
Residuals should be normally distributed.
By the CLT we can still use GLM without this so long as the sample is large.

Extending the GLM framework
Next Up:
Later:
> all built on the same statistical foundation