Definition
Sampling bias is a systematic, non-random error introduced when some members of the target population have a higher or lower probability of being included in the sample than they should. The result is a sample whose characteristics differ from the population in a predictable direction — and any estimate built from that sample inherits the same tilt. Crucially, sampling bias does not vanish as the sample grows; collecting more biased data merely produces a more confidently wrong answer.
The phenomenon is one of the most common reasons that careful-looking analyses produce conclusions that fail to replicate or fall apart when applied to a different population. It is also one of the easiest errors to commit and one of the hardest to diagnose from the data alone.
Why it matters
How it works
The classic example is the 1936 Literary Digest presidential poll, which surveyed millions of subscribers, car owners, and telephone users — and confidently predicted that Alf Landon would defeat Franklin Roosevelt in a landslide. Roosevelt won 46 of 48 states. The sample was enormous but drawn from a frame that systematically over-represented wealthier Americans, who in 1936 differed sharply from the broader electorate in their voting preferences. The bias was structural, baked into who could be reached at all, and no amount of additional respondents from the same frame would have caught the error. A much smaller probability sample collected by George Gallup that year did call the result correctly.
Survivorship bias works similarly. Studying the habits of successful companies, durable buildings, or long-lived investors tells you about the survivors but says nothing about the population of attempts — many of which had identical habits and failed. Non-response bias arises when the people who decline to answer differ systematically from those who reply; a survey of patient satisfaction that only the happy customers return distorts the rating upward. The common pattern is that the bias lives in the gap between the population we want to know about and the population we actually observe, and closing that gap requires thinking about the sampling mechanism, not just the numbers it produced.