
The economist’s data analysis skillset.
… a flexible approach to run many statistical tests.
The Linear Model: \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
OLS Estimation: Minimizes \(\sum_{i=1}^n \varepsilon_i^2\)
A one-sample t-test is a horizontal line model.

\[Temperature = \beta_0 + \varepsilon\]
> the intercept \(\beta_0\) is the estimated mean temperature
> the p-value is the probability of seeing \(\beta_0\) if the null is true
A regression is a test of relationships.

\[\text{WaitTime} = \beta_0 + \beta_1 \text{MinutesAfterOpening} + \epsilon\]
> the intercept parameter \(\beta_0\) is the estimated temperature at 0 on the horizontal
> the slope parameter \(\beta_1\) is the estimated change in y for a 1 unit change in x
> the p-value is the probability of seeing parameter (\(\beta_0\) or \(\beta_1\)) if the null is true
Which model do you think offers better predictions?

> our model will offer inaccurate predictions if some assumptions aren’t met
Our test results are only valid when the model assumptions are valid.
Assumption violations affect our inferences
If assumptions are violated:
> to test whether the model is ‘specified’, we can calculate the residuals and the model predictions
… we can directly examine the error of the model.

> this is \(\varepsilon\)
… we can directly examine the predictions of the model.

> this is \(\hat{y}\), the model prediction
A Residual Plot directly visualizes the error for each model estimate.
The error term should be unrelated to the fitted value.

> the left figure shows that the model is equally wrong everywhere
> the right figure shows that the model is a good fit at only some values
A non-linear relationship will produce non-linear residuals.

> linear model misses curvature, leading to systematic errors
Transform variables to become linear
Adding a square term or performing a log transformation can fix the problem.
instead of
\[\text{income} = \beta_0 + \beta_1 \text{age} + \varepsilon\]
we could use
\[\text{income} = \beta_0 + \beta_1 \text{age} + \beta_2 \text{age}^2 + \varepsilon\]
It’s also common to log transform either the \(x\) or \(y\) variable.
Residuals should be spread out the same everywhere.
Which one of these figures shows homoskedasticity?

> the left figure shows constant variability (homoskedasticity)
> the right figure shows increasing variability (heteroskedasticity)
> residual plots should show that the model is equally wrong everywhere
The spread of residuals should not change across values of X.

> the spread of points increases as education increases
> PhD wages vary more than high school wages
Robust standard errors give more accurate measures of uncertainty
Robust Standard Errors adjust for the changing spread in our data.
Use robust standard errors to give more accurate hypothesis tests.
Residuals should be normally distributed.
By the CLT we can still use GLM without this so long as the sample is large.

Observations are independent from each other
We’ll return to this assumption in Part 4.4 | Timeseries.
Extending the GLM framework
Next Up:
Later:
> all built on the same statistical foundation