
The economist’s data analysis pipeline.
What is the value of an additional square foot of living area?
House prices depend on:
What is the value of an additional square foot of living area?
Pittsburgh housing data with 10,000 homes.
Variables:
SALEPRICE: Sale price of the homeLOGPRICE: Log of sale priceFINISHEDLIVINGAREA: Square footage of finished living spaceYEARBLT: Year the home was built
Lets predict price using Living Area and Year Built.

Both living area and year built have positive relationships with price.
Predictor variables Living Area and Year Built are themsevles correlated.

Larger homes tend to be built more recently.
So maybe larger homes are more valuable simply because they’re newer….
We can model multiple predictor variables simultaneously.
Extending the best-fitting line to multiple dimensions
Single Variable:
\[\text{LogPrice} = \beta_0 + \beta_1 \times \text{LivingArea} + \epsilon\]
Multiple Variables:
\[\text{LogPrice} = \beta_0 + \beta_1 \times \text{LivingArea} + \beta_2 \times \text{YearBuilt} + \epsilon\]
Interpretation:
What is the value of an additional square foot?
Without controlling for year built:
With controlling for year built:
This is what we call a control variable.
We can adjust for multiple variables simultaneously.
Model 1: Living Area Only
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept 10.2029 0.031 328.476 0.000 10.142 10.264
FINISHEDLIVINGAREA 0.0005 1.65e-05 33.260 0.000 0.001 0.001
======================================================================================
Model 3: Living Area + Year Built Control
======================================================================================
coef std err t P>|t| [0.025 0.975]
--------------------------------------------------------------------------------------
Intercept -17.9337 0.741 -24.213 0.000 -19.386 -16.482
FINISHEDLIVINGAREA 0.0005 1.56e-05 29.095 0.000 0.000 0.000
YEARBLT 0.0145 0.000 38.018 0.000 0.014 0.015
======================================================================================
Compare coefficients with and without the control.

Without control: size effect was inflated by age effect
With control: we get the size effect between houses of equal age
Checking if our model improved

Residuals look somewhat more random with the control.
For log outcomes, coefficients represent proportional changes.
For small coefficients (< 0.1):
\[\beta \times 100 \approx \text{percentage change}\]
Interpretation (holding year built constant):
- Each additional sq ft increases price by 0.045%
- 100 additional sq ft increases price by 4.55%
Interpretation (holding living area constant):
- Each year newer increases price by 1.454%
- A house 10 years newer increases price by 14.54%
A control variable compares two similar observations.
Without control (simple regression):
\[\text{LogPrice} = \beta_0 + \beta_1 \times \text{LivingArea} + \epsilon\]
With control (multiple regression):
\[\text{LogPrice} = \beta_0 + \beta_1 \times \text{LivingArea} + \beta_2 \times \text{YearBuilt} + \epsilon\]
Main ideas about numerical control variables
Lets find the value of an extra square feet in the Pittsburgh housing market.
# Create residual plots for both models
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(12, 5))
# Model 1: Without control
residuals_area = model_area.resid
predictions_area = model_area.fittedvalues
ax1.scatter(predictions_area, residuals_area, alpha=0.5, color='#4C72B0')
ax1.axhline(y=0, color='red', linestyle='--', linewidth=2)
ax1.set_title('Model 1: Without Year Built Control')
# Model 3: With control
residuals_both = model_both.resid
predictions_both = model_both.fittedvalues
ax2.scatter(predictions_both, residuals_both, alpha=0.5, color='#4C72B0')
ax2.axhline(y=0, color='red', linestyle='--', linewidth=2)
ax2.set_title('Model 3: With Year Built Control')