Deciding on differences

5 min read

Core idea

Statistical hypothesis testing asks one disciplined question: how likely is the observed difference under the assumption that nothing real is going on? Start with the null hypothesis (H0) — the conservative claim that there is no effect, no difference, nothing to see. Compute the probability of obtaining data as extreme as the data you saw, assuming H0 is true. If that probability is small enough — conventionally below 5 percent, or below 1 percent for stricter standards — reject H0 in favour of the alternative hypothesis (H1). The framework is the statistical analogue of the legal principle: innocent until proven guilty, where the burden of proof is on the prosecution.

Why it matters

Every controlled experiment, every clinical trial, every A/B test on a website ultimately reaches a moment where someone has to decide: is the observed difference real, or is it noise? Without a disciplined framework, that decision degenerates into vibes-based reasoning, motivated optimism, or paralytic skepticism. Hypothesis testing forces explicitness: state the null, state the alternative, state the threshold, then let the data answer. It is one of the great organising achievements of twentieth-century statistics, and the foundation of every credible research finding you will ever read.

Mental model

The cleanest mental anchor: criminal trials. The defendant is presumed innocent (the null hypothesis). The prosecution must show evidence so compelling that innocence becomes implausible beyond reasonable doubt. The jury never declares the defendant "innocent" — only "guilty" or "not guilty". Statistical hypothesis testing follows the same logic, with H0 as the presumption and the data as the evidence.

The legal analogy

The decision flow

A hypothesis test is a five-step pipeline. Lay out the null and alternative explicitly, choose a significance level before peeking at the data, compute the test statistic, compare to the threshold, and report the decision in plain language.

The decision flow

What the p-value is, and is not

The most commonly misread statistic in all of science. The p-value is a conditional probability: it answers "how surprising is this data, assuming H0?" — not "how likely is H0?". A small p-value means the data are unusual under the null, which makes the null look less believable — but it never equates to a probability of the null.

What the p-value is, and is not

Practical application

  1. State the question crisply. "Does drug X reduce blood pressure more than placebo?" — not "is drug X good?".

  2. Write the null and alternative explicitly. H0: mean BP change under X equals mean BP change under placebo. H1: mean BP change under X differs from mean BP change under placebo.

  3. Choose the significance level before collecting data. 0.05 for exploratory work, 0.01 for confirmatory or medical decisions, 0.001 for high-stakes (e.g. particle physics). Commit before the data arrive.

  4. Collect the data, compute the test statistic, compute the p-value. Standard tools — t-tests, chi-squared tests, z-tests — have built-in functions in any spreadsheet or statistics package.

  5. Compare and decide. If p < α, reject H0. If not, fail to reject. Report both the p-value and the size of the effect — a tiny effect can be "significant" with enough data, and a large effect can be "non-significant" with too little.

  6. Interpret in plain language. "We have evidence consistent with drug X reducing blood pressure" beats "we reject H0 at the 0.05 level" for any non-statistical audience.

Example

A coffee chain reformulates one of its drinks and wants to know whether customers prefer the new recipe. They run a blind taste test: 200 customers taste both versions in random order and pick the one they like better. Of the 200, 117 pick the new recipe.

State the null and alternative. H0: customers have no preference — they pick each version with probability 0.5. H1: customers do have a preference (the new recipe wins more often than 0.5).

Under H0, the number of customers choosing the new recipe follows a binomial distribution with 200 trials and p = 0.5. The expected count is 100, with standard deviation about 7.1. The observed 117 sits roughly 2.4 standard deviations above the mean. The corresponding p-value (one-sided) is about 0.008 — well under the 0.05 threshold and even under the stricter 0.01 threshold.

Conclusion: the data are very unlikely under H0, so reject H0. The chain has evidence customers prefer the new recipe. The effect size is also reportable: 58.5 percent preferred the new recipe versus 50 percent expected by chance — an 8.5-percentage-point lift that is both statistically significant and practically meaningful.

Compare with a counterfactual: if only 105 of 200 customers had picked the new recipe, the p-value would be roughly 0.24 — far above 0.05. The chain would fail to reject H0, but this would not mean customers were indifferent. It would mean the test was underpowered to detect a small preference. The right next step in that scenario is not "ship the old recipe" but "test with a larger sample".

Continue exploring

Tags