Definition
A spurious correlation is a statistical relationship between two variables that exists in the data but does not reflect any direct causal link between them. The numbers move together — sometimes with a high correlation coefficient and a tiny p-value — but neither one is producing the other. The apparent association is an artefact of a hidden third factor, of how the data was selected, or simply of chance over a small number of comparisons.
Spurious correlations are the single most common reason that headline statistical findings turn out to be misleading. The mathematics behind correlation is silent about cause; only an outside argument — a controlled experiment, a domain mechanism, a careful design — can promote an association into a causal claim.
Why it matters
How it works
The classic textbook example is the strong positive correlation between ice cream sales and drowning deaths. Higher ice cream sales in a given month do not cause drowning, nor does drowning cause ice cream consumption. Both variables are driven by a third — hot weather — which independently increases swimming activity and ice cream purchases. The shared cause produces the association even though the two outcomes are causally independent. Statisticians call this third variable a confounder, and the only general defence against it is to identify and condition on it, or to randomise the input of interest so that confounders cannot align systematically with it.
A second mechanism is plain chance over many comparisons. If you compute correlations between every pair of variables in a dataset of a hundred columns, you will perform almost five thousand tests. At a significance level of 0.05, roughly two hundred and fifty of those will reach significance by accident even if every pair is truly independent. The economist Tyler Vigen has assembled hundreds of these — divorce rates in Maine vs. per-capita margarine consumption, US spending on science vs. suicides by hanging — to make the point. The numbers are real; the relationships are not. A third mechanism is selection: if the sample only includes individuals who scored highly on at least one of two unrelated traits, the two traits will appear negatively correlated within the sample even though they are independent in the population.