The economist’s data analysis skillset.
If all we see is the sample, how do we learn about a population?
If we know the random variable, we can learn many things about the population.
> when we know the probability function, we can calculate everything exactly
If we know the random variable, we can learn many things about the population.
> but what can we know about the population if we only see the sample?
But if all we see is the sample, what can we know about a population?
> how do we learn about \(\mu\) if all we have is \(n\), \(\bar{x}\), and \(S\)?
Let’s pretend we don’t know the probability function for dice.
Let’s start with something simple.
Your samples have a lot of variability!
> this variability perfectly matches what we would expect from a fair die
Now take a sample of two rolls and compute the mean.
Next is something slightly less boring.
Each sample has a slightly different sample mean.
> there’s a lot of variability in your sample means!
> what do you expect to see when we plot these sample means (\(\bar{x}\))?
The distribution of sample means bunches in the middle.
> our sample means are more bunched (like a pyramid) in the middle! why?
> there are more ways to get 7/2 than 2/2!
Now take a sample of three rolls and compute the mean.
Next is something even less boring.
The distribution of sample means with n=3.
> what do you notice about the shape with n=3?
The distribution of sample means with n=3.
> there’s some curvature to the shape — the edges are rounding into a curve
Now let’s really increase the sample size.
Next is something very un-boring.
Your individual samples each look different.
> there are even more ways your sample could look!
> what do you expect to see when we plot these sample means (\(\bar{x}\))?
What happens when we really increase the sample size?
> the distribution of sample means gets tighter and more bell-shaped
What happens when we really increase the sample size?
> what is this probability function in red?
The distribution of sample means approximates a normal distribution as sample size increases.
\[\bar{x} \sim N\Big(\mu, \frac{\sigma}{\sqrt{n}}\Big)\]
Where does \(\sigma / \sqrt{n}\) come from?
Each observation \(x_i\) is drawn independently with variance \(\sigma^2\), so: \[Var(x_1 + x_2 + \cdots + x_n) = n\sigma^2\]
Dividing by \(n\) divides the variance by \(n^2\): \[Var\Big(\frac{x_1 + x_2 + \cdots + x_n}{n}\Big) = \frac{n\sigma^2}{n^2} = \frac{\sigma^2}{n}\]
Take the square root: \[SD(\bar{x}) = \sqrt{\frac{\sigma^2}{n}} = \frac{\sigma}{\sqrt{n}}\]
Does the CLT work for distributions that aren’t as nice?
Question: Does the CLT still work when the population looks asymmetric?
Simulate 1,000 sample means from a chi-squared population with n=1.
> with n=1, the sample means are just the raw observations
The distribution of sample means looks just like the population — very skewed.
> now increase the sample size to n=5
The skew is already diminishing.
> now increase to n=30
It looks normal — despite the skewed population.
> now increase to n=1000
Very tight, very normal.
> the skew has completely disappeared
Overlay the n=1 distribution behind the n=1000 distribution.
# Overlay n=1 (raw population) behind n=1000 (tight, normal)
means_1 = stats.chi2.rvs(df=3, size=(1000, 1)).mean(axis=1)
means_1000 = stats.chi2.rvs(df=3, size=(1000, 1000)).mean(axis=1)
sns.histplot(means_1, bins=30, alpha=0.3, stat='density', label='Sample means (n=1)')
sns.histplot(means_1000, bins=30, alpha=0.7, stat='density', label='Sample means (n=1000)')From skewed population to normal sampling distribution.
> the CLT works for (nearly) any distribution shape
The full picture — sample means converge to normal as n increases.
Three things to notice about the sampling distribution of \(\bar{x}\).
The CLT isn’t magic. There are a few conditions.
From an unobservable population to a knowable sampling distribution.
We know the sampling distribution. Now what do we do with it?
> the CLT gives us the distribution — Parts 3.3 and 3.4 show us how to use it