
The economist’s data analysis skillset.
How close are the sample mean (\(\bar{x}\)) and the population mean (\(\mu\))?
The sample mean follows a normal distribution around the true mean (\(\mu\)).
The standard deviation of the sample means is the standard error: \[SE = \frac{\sigma}{\sqrt{n}}\]
> as the sample size grows, the variability in sample means gets smaller
\(\mu = 12\) and \(\sigma = 2.5\)
Smaller Sample Size: \(n = 30\)

Larger Sample Size: \(n = 200\)

The standard error (SE) measures the precision of the estimate.
With \(n\) independent observations, each has a variance of \(\sigma^2\).
\[VAR(a) + VAR(b) = VAR(a + b)\]
\[\frac{VAR\big(a\big)}{n} = VAR\Big(\frac{a}{n^2}\Big)\]
Therefore the standard error is \(\sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\).
If we know \(\sigma = 2.5\), we can calculate probabilities.
What’s the probability \(\bar{x}\) is within one standard error of \(\mu = 12\) with \(n=30\)?

> \(P(\mu - SE \leq \bar{x} \leq \mu + SE) \approx 0.68\)
> so 68% of the time \(\bar{x}\) will fall within \([\mu - \frac{\sigma}{\sqrt{n}}, \mu + \frac{\sigma}{\sqrt{n}}]\)
> we call \([\mu - \frac{\sigma}{\sqrt{n}}, \mu + \frac{\sigma}{\sqrt{n}}]\) a 68% confidence interval
\(\mu = 12\) and \(\sigma = 2.5\) and \(n=30\)
Question: what’s the probability \(\bar{x}\) is closer than \(2\cdot SE\) to \(\mu\)?
Calculate the 95% confidence interval for waiting times.
Generate some sample data.
Calculate sample statistics.
> if we took many samples, 95% of the time this interval would contain the truth
> we often just say: “we’re 95% confident the truth is in this interval”
What if we don’t know \(\sigma\) either?
> we used \(\bar{x}\) to estimate \(\mu\)
> can we use s to estimate \(\sigma\)?
> yes, but there’s a catch…
Sample standard deviation (\(s\)) has its own sampling variability.

> this adds extra uncertainty to our interval
The t-distribution precisely accounts for the variation in \(s\) around \(\sigma\).
> \(\bar{x}\) follows a normal distribution with \(\mu\) and \(\sigma\)
> key insight: since \(s\) is random, using it instead \(\sigma\) introduces another r.v.
> this gives us the t-distribution with n-1 degrees of freedom

… acounts for the extra uncertainty in \(s\) around \(\sigma\).

> t-distribution has heavier tails than normal
> approaches normal as sample size (\(n\)) increases
The probability of closeness will be too large if we use the Normal instead of t.
Now we can quantify our uncertainty about an unknown \(\mu\).
Next Time: We’ll center the distribution on \(\bar{x}\) instead of \(\mu\) to develop a test of closeness.
How would the confidence interval change if we:
Why use t-distribution instead of normal?
What does “95% confident” really mean?
How could this help with economic decision-making?