Concept

Correlation

Definition

Correlation is a scalar summary, bounded between minus one and plus one, of how two random variables co-move. Positive values indicate that high values of one variable tend to coincide with high values of the other; negative values indicate the opposite; zero indicates no monotonic association. The most familiar variant is the Pearson correlation coefficient, which measures the strength of a linear relationship by normalising the covariance of the two variables by the product of their standard deviations.

Two rank-based alternatives matter when the relationship is monotonic but non-linear. Spearman's rho computes Pearson correlation on the ranks rather than the raw values; it captures any monotonic pattern and is robust to outliers and to non-linear transformations of either axis. Kendall's tau counts the proportion of concordant minus discordant pairs and has better small-sample statistical properties, at the cost of being more expensive to compute. Choosing between them is a matter of what assumption about the relationship you are willing to make.

Why it matters

How it works

Pearson's coefficient assumes both variables are roughly continuous and that the relationship between them is linear; it is sensitive to outliers (one extreme point can dominate the calculation) and meaningless when the underlying relationship is non-monotonic. Spearman corrects for non-linearity but assumes the association is still monotonic — if Y increases then decreases as X grows, neither Pearson nor Spearman will detect it; you need a non-monotonic measure like mutual information or distance correlation. Kendall is mathematically attractive when sample sizes are small but is rarely the default in production data work because Spearman is faster to compute and behaves similarly on most data.

Three practical traps recur in applied work. Spurious correlation appears when two unrelated variables both trend over time — both go up, both have high Pearson correlation, neither causes the other. Differencing or detrending the series is the standard fix. Tail-dependence is the phenomenon that correlation estimated in calm periods systematically understates the comovement that emerges in crisis periods — equities that look diversifying in normal times all crash together in 2008. Mixed populations can produce a strong correlation in pooled data that disappears or reverses within each subgroup (Simpson's paradox); the resolution is to model the subgroups explicitly. Always plot the data — a scatter plot will reveal in a glance what a correlation coefficient summarises away.

Where it goes next

Continue exploring

Tags