Choosing a sample

4 min read

Core idea

A statistical study almost never measures the whole population — it samples. Three nested ideas make this rigorous: the population (everything you'd ultimately like to know about), the sampling frame (the enumerable list you can actually draw from), and the sample (the subset you measure). Random sampling is the gold standard for fairness, but systematic, stratified, and cluster sampling all trade some randomness for practical advantages. Larger samples reduce sampling error — that is the central limit theorem in everyday language.

Why it matters

The most rigorous analysis cannot rescue a badly drawn sample. The 1948 US presidential election was famously called for Dewey because the polling agencies relied on telephone polls that systematically over-represented Republican voters. Survey methodology is upstream of every conclusion — pick the wrong sampling frame and the most beautiful regression line is meaningless. Conversely, a thoughtfully chosen sample of a few thousand can predict a national election, because the size of the sample matters far less than its representativeness.

Mental model

Nested levels: population, frame, sample

The three terms are commonly conflated. Keep them straight and most sampling questions answer themselves.

Nested levels: population, frame, sample

The four sampling techniques

Random is the textbook ideal, but real-world projects often need cheaper or more targeted methods. Each technique has a niche.

The four sampling techniques

Sampling error and the central limit theorem

Even with perfect random sampling, two samples drawn from the same population will differ. The size of that natural variation shrinks as the sample size grows — a 100-tin sample reveals the true mean more tightly than a 5-tin sample.

Sampling error and the central limit theorem

Practical application

To design a defensible survey, walk through this checklist before collecting any data.

  1. Define the population precisely. Not just "voters" — "registered voters who plan to vote in the May 2026 election in constituency X."

  2. Build (or choose) a sampling frame that covers as much of the population as practical. A voter roll is a frame; a telephone directory is a different frame; both miss different parts of the population.

  3. Pick a sampling technique. Random for fairness, systematic for cost, stratified for guaranteed subgroup coverage, cluster for geographic efficiency. Most professional surveys combine these.

  4. Choose a sample size large enough to give the precision you need. A useful intuition: halving the confidence-interval width requires roughly quadrupling the sample size.

  5. Generate random numbers honestly. Use a table, a calculator's RNG, or a computer's random function. Avoid "convenience samples" — friends, first-on-the-list, easiest to reach — which import unmodelled bias.

  6. Report the confidence interval, not just the point estimate. "Average tin weight is 423 g ± 4 g at 95% confidence" is honest. "Average tin weight is 423 g" is misleading.

  7. Audit for human error after data collection. Did all respondents answer? Did the wording bias them? Were any clusters under-represented in practice despite the design?

Example

Suppose you run a small SaaS company with 2,400 active users and you want to estimate average monthly time-in-app to inform a pricing decision.

Population. All 2,400 active users.

Sampling frame. Your user database is the natural frame. Coverage looks excellent — every active user is in there. But: how do you define "active"? Logged in this month? Paid this month? Your operational definition shapes who counts as eligible.

Sample size. With 2,400 users, you don't need to interview all of them. A random sample of 200 users (about 8%) will give a confidence interval narrow enough for pricing decisions. Doubling to 400 only narrows the interval by a factor of √2 ≈ 1.4, not 2 — diminishing returns set in fast.

Sampling technique. Random sampling from the user-ID list is straightforward — your database has an RNG, your script can pick 200 IDs. But: do you have free-tier and paid-tier users mixed? If pricing decisions hinge on paid-tier behaviour, stratified random sampling is better — split the frame into free and paid, sample 100 from each. This guarantees both tiers are represented even if the free tier dominates.

Watch for human error. If you instead choose "the 200 users who replied to our email survey," you've drawn a convenience sample biased toward engaged customers. Their time-in-app will systematically overstate the population mean. The whole pricing decision could rest on a number inflated by ~20% — and you'd never know unless you instrument the discrepancy explicitly.

Report with a confidence interval. "Median monthly time-in-app is 7.4 hours; the 95% confidence interval is [6.8, 8.0] hours" tells the pricing team both the estimate and the uncertainty — far more useful than a single point estimate that might be off by 15%.

The discipline of sampling is mostly a discipline of admitting where your knowledge is partial — and then quantifying how partial.

Continue exploring

Tags