Making sense of probabilities

7 min read

Core idea

Probability tells you what happens if you knew the underlying chances. Statistics is the inverse problem: you see the outcomes and have to guess the chances. The two are stitched together by two theorems. The Law of Large Numbers (LLN) says that as you collect more independent observations of the same random quantity, their average homes in on the true mean. The Central Limit Theorem (CLT) says that the shape of the wobble around that true mean — how the sample average is distributed across different possible samples — becomes a bell curve, regardless of what the underlying distribution looks like.

Author's framing: A single random outcome is unpredictable. The average of many such outcomes is one of the most predictable objects in mathematics.

That asymmetry is the engine of empirical science. We cannot say what your next coin flip will be, but we can promise that the proportion of heads in a million flips will sit within a hair's breadth of one-half, and we can quantify exactly how tight that hair is.

Why it matters

Without LLN, polling, insurance, drug trials, and quality control would all be incoherent — there would be no reason to believe a sample tells you anything about a population. Without CLT, we could compute the average but not the uncertainty around it; we'd know our estimate, not how much to trust it. Together they explain why a well-designed sample of 1,000 voters can predict 150 million votes, why casinos always win in the long run despite losing many individual hands, and why a manufacturer can guarantee tolerances on parts they never directly measure.

Why individual outcomes stay stubborn

The LLN promises convergence of averages, not of individuals. A fair coin that has come up tails nine times in a row has exactly a one-half chance of heads on the tenth toss — the coin has no memory, and there is no cosmic ledger forcing balance. The intuition that "things should even out soon" is the gambler's fallacy. What actually happens is that future flips dilute the imbalance: nine surplus tails embedded in a million tosses is invisible. Convergence comes from drowning, not correcting.

Why the bell curve is everywhere

The CLT explains an empirical observation that bewildered nineteenth-century statisticians: bell curves keep showing up in places they have no business showing up — human heights, measurement errors, exam scores, asset returns over short windows. The reason is that each of these quantities is, in effect, the sum of many small independent contributions. The CLT says that any such sum, regardless of how weirdly the individual contributions are distributed, drifts toward a bell shape. The normal distribution is not a law of nature; it is a law of aggregation.

Key takeaways

Mental model — LLN and CLT as two parallel guarantees

Mental model — LLN and CLT as two parallel guarantees

How the two theorems work

The Law of Large Numbers

LLN is the formal statement of an old intuition: average enough independent observations and the noise cancels. Toss a fair coin ten times and you might see seven heads — a 20-percentage-point gap from the expected 50%. Toss it a thousand times and the same proportional gap would require 700 heads, which is overwhelmingly unlikely. The key fact is that the absolute number of excess heads can grow with N, but the proportion of excess heads must shrink. That shrinkage is LLN.

LLN comes in two strengths. The weak law says that for any tolerance you pick, the probability that the sample mean is outside that tolerance goes to zero. The strong law says something tighter — that the sample mean converges to the true mean almost surely, meaning the chance of an infinite sequence where it fails to converge is exactly zero. For everyday purposes the distinction does not matter; both say "averages settle down."

The Central Limit Theorem

The CLT goes further than LLN in a strange direction. It tells you not just where the sample mean ends up but how it's distributed across hypothetical repeats of the experiment. Imagine running your N-sample experiment ten thousand times and plotting a histogram of the ten thousand sample means. The CLT says that histogram approaches a bell curve, with:

  • centre at the true mean (the same point LLN promised),
  • width proportional to the underlying standard deviation divided by the square root of N.

The remarkable part is that the underlying distribution can be anything — uniform, skewed, lumpy, even discrete — and the bell shape still emerges for the sample mean. This is why so many real-world quantities look bell-shaped: they are themselves sums or averages of many smaller effects.

Statistical estimation

This is where probability flips into statistics. You no longer know the true mean — you have data, and you want to guess. The natural guess is the sample mean itself. LLN tells you this guess is consistent: with enough data, it lands on the truth. CLT tells you how uncertain it is for any finite N: the standard error is the underlying standard deviation divided by the square root of N. If you do not know the underlying standard deviation either, the sample standard deviation is itself a consistent estimate of it.

The square-root scaling is humbling. To halve your uncertainty, you need four times the data; to cut it by a factor of ten, you need a hundred times the data. This is why national polls of a few thousand people can be remarkably precise about a population of millions, and why pushing precision further is expensive.

Confidence intervals

A confidence interval converts the CLT shape into a usable range. A 95% confidence interval for the true mean is, roughly, the sample mean plus or minus 1.96 standard errors. The 1.96 comes directly from the bell curve — 95% of a normal distribution's mass sits within that many standard deviations of its centre.

The interpretation is subtler than most people realise. A 95% confidence interval does not say "there is a 95% probability that the truth is in this particular interval." The truth is a fixed (if unknown) number; it is either in the interval or not. What 95% refers to is the procedure: if you reran the experiment many times and built an interval each time, about 95% of those intervals would contain the true value. The reliability lives in the method, not in any single interval.

Practical application

  1. Decide what you're estimating. A proportion (fraction of voters preferring candidate A)? A mean (average household income)? A difference (drug response minus placebo response)? The CLT applies to all three — but the standard error formula changes slightly for each.

  2. Pick a sample size N before you collect data. Use the square-root rule in reverse: if you want a margin of error of about 3 percentage points on a proportion, you need roughly N = 1,000; for 1 point, N ≈ 10,000. Aiming for a tighter margin than you actually need is wasted effort.

  3. Compute the sample mean and the sample standard deviation. The first is your point estimate. The second, divided by the square root of N, is your standard error.

  4. Build the 95% interval as point estimate ± 1.96 × standard error. Report both. A point estimate without a margin is misleading; a margin without a point estimate is unusable.

  5. Sanity-check the assumptions. LLN and CLT both assume independent observations from the same distribution. If your samples are clustered (siblings in a health study, votes from one neighbourhood), the effective N is smaller than the headcount, and your interval is narrower than it should be.

Example: a quality-control inspector

A factory makes ball bearings with a target diameter of 10 mm. The manager wants to know whether the production line is drifting. She measures 100 bearings and finds an average diameter of 10.04 mm with a sample standard deviation of 0.20 mm.

  • Standard error of the mean: 0.20 ÷ √100 = 0.02 mm.
  • 95% confidence interval: 10.04 ± 1.96 × 0.02 = [10.001, 10.079] mm.

The interval sits entirely above 10 mm, so it is implausible — at the 5% level — that the line is still producing on target. The manager investigates. If she had measured only 25 bearings, the standard error would have been 0.04 mm, the interval would have been [9.96, 10.12] mm, and 10 mm would have been comfortably inside it. The same drift, undetectable with a small sample, becomes statistically clear with a larger one. That is the CLT at work: not by changing the mean, but by sharpening the bell around it.

The example also illustrates the gambler's-fallacy mistake the manager might be tempted to make. If the next ten bearings happen to come in below 10 mm, it is not because the line is "correcting itself" — the machinery has no memory. Each bearing is an independent draw from whatever distribution the (possibly misaligned) line currently produces. Convergence happens only over the long run, and only on the average.

Continue exploring

Tags