
The economist’s data analysis workflow.
… a flexible approach to run many statistical tests.
The Linear Model: \(y_i = \beta_0 + \beta_1 x_i + \varepsilon_i\)
OLS Estimation: Minimizes \(\sum_{i=1}^n \varepsilon_i^2\)
A one-sample t-test is a horizontal line model.

\[Temperature = \beta_0 + \varepsilon\]
> the intercept \(\beta_0\) is the estimated mean temperature
> the p-value is the probability of seeing \(\beta_0\) if the null is true
A regression is a test of relationships.

\[\text{WaitTime} = \beta_0 + \beta_1 \text{MinutesAfterOpening} + \epsilon\]
> the intercept parameter \(\beta_0\) is the estimated temperature at 0 on the horizontal
> the slope parameter \(\beta_1\) is the estimated change in y for a 1 unit change in x
> the p-value is the probability of seeing parameter (\(\beta_0\) or \(\beta_1\)) if the null is true
Q. Is temperature lower in neighborhoods with more green space?

Q. Does temperature change as we move out on the horizontal axis?
\[Temperature = \beta_0 + \beta_1 \cdot HighGreen + \varepsilon\]
> the GLM performs a t-test on \(\beta_1\), whether the difference is significant
Q. Does temperature change as we move out on the horizontal axis?

\[Temperature = \beta_0 + \beta_1 \cdot HighGreen + \varepsilon\]
How would we interpret \(\beta_0\) here?
> \(\beta_0\) is the mean temperature in (\(x=0\)) low green space cities (22.03°C)
Q. Does temperature change as we move out on the horizontal axis?

\[Temperature = \beta_0 + \beta_1 \cdot HighGreen + \varepsilon\]
How would we interpret \(\beta_1\) here?
> Cities with Green Space (x=1) have a temperature that is lower by \(\beta_1\)
> ie. a one unit increase in \(x\) changes temperature by \(\beta_1\)
Q. Does temperature change as we move out on the horizontal axis?

> p-value on \(\beta_1\): probability of a slope as extreme as \(\beta_1\) under the null dist
Do low-income neighborhoods face higher pollution levels?
Step 1: Summarize the data

Do low-income neighborhoods face higher pollution levels?
Step 2: Build a model
\[Pollution = \beta_0 + \beta_1 \cdot LowIncome + \varepsilon\]
Do low-income neighborhoods face higher pollution levels?
Step 3: Estimate the model

\(\beta_0\) = Mean pollution in high-income areas (23.9)
\(\beta_1\) = Additional pollution in low-income areas (15.9)
Do low-income neighborhoods face higher pollution levels?
Step 4: Check the residuals
Do low-income neighborhoods face higher pollution levels?
Step 5: Interpret and communicate the findings

> A significant positive \(\beta_1\) suggests environmental quality differences between neighborhoods
GLM’s unified framework for testing statistical models
One-Sample T-Test: Continuous outcome variable (\(y\)) with only an intercept
\[y = \beta_0 + \varepsilon\]
Relationships: Continuous outcome variable (\(y\)) with a continuous predictor (\(x\))
\[y = \beta_0 + \beta_1 x + \varepsilon\]
Two-Sample T-Test: Continuous outcome variable (\(y\)) with a dummy (\(Group\))
\[y = \beta_0 + \beta_1 \cdot Group + \varepsilon\]
Multiple Regression: Adding control variables to isolate relationships
> all use the same OLS framework and interpretation of coefficients and p-values