Summarising data
4 min read
Core idea
To compress a batch of numbers into something you can think with, answer two questions. Where is the centre? Use the mean (arithmetic average), the median (middle value when sorted), or the mode (most common value). How spread out are the values around that centre? Use the range, inter-quartile range, mean deviation, variance, or standard deviation. Together, a centre statistic and a spread statistic describe most of what matters about a distribution.
Why it matters
An average without a measure of spread can lie. "Average earnings rose 8%" sounds positive, but if four workers had pay cut and one got a huge raise, four in five are worse off and the average still rose. Spread reveals inequality, risk, and variability that a single number hides. Together, centre-plus-spread is the smallest honest summary of a dataset — every richer statistical method builds on this pair.
Mental model
The three averages — when each is appropriate
Mean, median, and mode are not interchangeable. The right average depends on the data type and on whether extreme values should pull the centre or be tolerated as outliers.
The five-figure summary and boxplot
A boxplot encodes five numbers — the two extremes, the two quartiles, and the median — into one visual. It reveals centre, spread, skewness, and outliers at a single glance.
Variance and standard deviation — why we square
The mean deviation (average of absolute deviations) gets us partway. The variance squares the deviations instead, which turns negatives into positives and amplifies the contribution of outliers. The square-root of the variance — the standard deviation — brings the result back to the original units.
Practical application
To produce a defensible numerical summary of a batch of data, follow this routine.
-
Sort the values. Many calculations — median, quartiles, extremes — depend on the sorted order.
-
Pick the centre statistic based on the data type and on whether you want extremes to influence the answer. Use the mode for categories, the median for skewed or outlier-prone numerical data, the mean for clean symmetric data.
-
Pick a spread statistic to pair with the centre. If you reported the median, report the inter-quartile range. If you reported the mean, report the standard deviation. The two pairings travel together because the median + IQR are both based on ranks, and the mean + SD are both based on arithmetic distances.
-
For weighted data, use weighted formulas. When values come with frequencies (3 households of size 2, 7 households of size 3), the mean must multiply each value by its frequency before summing:
weighted mean = Σ(fX) / Σf. -
Standardise spread if you compare batches with different magnitudes. A spread of 3 mm against a median of 60 mm is proportionally larger than a spread of 4 mm against a median of 80 mm. Divide spread by median to get a unit-free comparison.
-
Draw the boxplot. A picture of the five-figure summary often surfaces patterns — skewness, outliers, gaps — that the numbers alone do not advertise.
Example
A small online bookstore wants to summarise the past month's order values. Eleven orders are recorded in pounds: 12, 15, 18, 20, 22, 24, 25, 28, 30, 35, 280.
Sorted. Already sorted above. Notice the £280 order — a clear outlier (likely a bulk gift purchase).
Mean. ΣX = 12 + 15 + 18 + 20 + 22 + 24 + 25 + 28 + 30 + 35 + 280 = 509. n = 11. Mean = 509 / 11 ≈ £46.27.
Median. Eleven values, so the median is the sixth in sorted order: £24.
Mode. No value repeats, so there is no useful mode here.
Range. 280 − 12 = £268. Distorted by the outlier.
Inter-quartile range. Lower quartile (the 3rd value) = £18. Upper quartile (the 9th value) = £30. IQR = 30 − 18 = £12.
The summary. A typical order is around £24 (median) with a typical spread of ±£12 across the middle 50% (IQR). The mean of £46 is misleading on its own — it is being pulled up by one large outlier.
If we strip the £280 order, the mean falls to (509 − 280) / 10 = £22.90, much closer to the median. This is exactly the dynamic that makes the median + IQR pair the better summary whenever your data has heavy tails or extreme values — and most real-world data does.
Related lessons
Related concepts
- Central Tendencylinked concept
- Meanlinked concept
- Medianlinked concept
- Standard Deviationlinked concept
- Variancelinked concept
- Descriptive Statisticslinked concept