Correlation: measuring the strength of a relationship
4 min read
Core idea
Where regression draws the best-fit line, correlation measures how tightly the points hug that line. The product-moment correlation coefficient r summarises the relationship in a single number between -1 and 1: r = 1 is perfect positive correlation, r = -1 is perfect negative correlation, r = 0 is no linear association at all. For rank-only data (positions rather than measurements) the analogous statistic is Spearman's rank correlation rs, computed from the differences in ranks. Crucially, neither coefficient says anything about cause and effect. A high r is necessary but not sufficient evidence for a causal link.
Why it matters
The correlation coefficient is one of the most cited statistics in everyday reporting and one of the most misread. Newspaper headlines casually treat r = 0.7 between two variables as proof that one drives the other; the careful statistician treats it as a clue that demands further investigation. Knowing what r does and does not mean separates literate consumers of statistics from those who routinely get fooled by spurious correlations.
Mental model
The number line of strength
The coefficient occupies a fixed range. Anchoring a few typical values on a mental number line lets you translate any r you encounter into a quick verdict on the relationship.
From cloud to coefficient
Regression and correlation answer different questions about the same scattergraph. Regression asks "what is the trend?". Correlation asks "how tight is the trend?". You compute both routinely; you interpret them separately.
The causation ladder
Even a near-perfect correlation does not imply causation. There are at least four distinct reasons two variables can correlate, only one of which is a direct cause-and-effect link. Climbing the ladder of plausibility requires ruling out the other three.
Practical application
-
Always plot before computing. A scattergraph with an obvious curve, an outlier, or two clusters can produce a misleadingly tidy
r. The number summarises; the picture reveals. -
Pick the right coefficient. Use Pearson's
rfor two measured variables. Use Spearman'srswhen one or both variables are ranks, or when the relationship is monotonic but not linear. -
Interpret with sample size in mind. A textbook rule of thumb: for "statistically meaningful" positive correlation, the threshold for
rfalls as sample size rises — about 0.6 atn = 20, about 0.3 atn = 90. Smaller samples need stronger coefficients to be convincing. -
List the rival explanations. For every correlation worth reporting, write out the four candidate explanations: A causes B, B causes A, common cause C, coincidence. Knock them out one by one with evidence.
-
Refuse the causal claim without an experiment. If the data are observational, the most you can say is "associated with". Reserve "causes" for situations where you have manipulated the cause.
Example
A school district notices that students who eat breakfast score higher on standardised tests; the correlation is r = 0.62 across 800 students. A naive read says: feed students breakfast, scores will rise. The district could spend millions on free breakfast programmes based on this.
A statistically literate analyst climbs the causation ladder. Hypothesis 1: breakfast directly improves cognitive performance. Hypothesis 2: high-performing students happen to eat breakfast more often (perhaps they are more organised). Hypothesis 3: family income drives both — wealthier families both serve breakfast more reliably and provide tutoring, books, and a stable home environment that lifts test scores. Hypothesis 4: noise in an 800-person sample, unlikely at r = 0.62.
Hypothesis 3 is the killer. Once family income is held constant in the analysis, the correlation between breakfast and scores often collapses. A controlled experiment — randomly assigning some students to receive a free breakfast and comparing their scores to a control group six months later — is the only design that can promote breakfast to a cause. The observational correlation, however high, cannot.
The general lesson: a correlation is a question, not an answer. It opens the investigation; only an experiment can close it.
Related lessons
Related concepts
- Correlationlinked concept
- Correlation Coefficientlinked concept
- Rank Correlationlinked concept
- Spurious Correlationlinked concept
- Causationlinked concept