Concept

Bayes Theorem

Definition

Bayes' theorem states that P(A | B) = P(B | A) × P(A) / P(B). It expresses the probability of hypothesis A given evidence B in terms of the reverse conditional — the probability of the evidence given the hypothesis — weighted by the prior probability of A and the marginal probability of B.

The theorem is mathematically trivial: it falls out in two lines from the definition of conditional probability, since P(A and B) equals both P(A | B) × P(B) and P(B | A) × P(A). Its conceptual reach, however, is enormous. Bayes' theorem is the formal rule for updating beliefs in the light of evidence, and it underlies modern machine learning, scientific inference, medical diagnosis, legal reasoning, and the entire Bayesian school of statistics. Whenever you have to move from "how likely is this evidence if the hypothesis is true?" to "how likely is the hypothesis now that I have seen the evidence?", Bayes' theorem is the bridge.

Why it matters

How it works

The formula in plain language

Three ingredients go in and one comes out. The prior P(A) is what you believed about hypothesis A before seeing the evidence. The likelihood P(B | A) is how probable the evidence B would be if A were true. The marginal P(B) is the overall probability of seeing evidence B at all, summed over every hypothesis weighted by its prior — typically computed as P(B | A) × P(A) + P(B | not A) × P(not A). The output is the posterior P(A | B) — your updated belief in A now that B has been observed.

Read the formula as a multiplication and a normalisation. Multiply the prior by the likelihood to get the joint probability of "A and B"; divide by the marginal probability of B so that the answer is a proper conditional probability on the slice of the world where B is true. Everything else in Bayesian inference — sequential updating, model comparison, parameter estimation — is repeated application of this single rule.

Prior, likelihood, posterior

The three Bayesian quantities each play a distinct role. The prior encodes your starting assumptions, and Bayesian methods are unapologetic that you must have some. The likelihood is set by the structure of the problem — the probability model for how the evidence is generated under each hypothesis — and is usually the least contentious ingredient. The posterior is the answer, and it becomes the new prior the next time evidence arrives. This loop — prior, likelihood, posterior, repeat — is the engine of belief updating and is often called Bayesian updating.

What makes the framework powerful is that priors and posteriors share a common currency: both are probabilities over the same set of hypotheses. There is no awkward translation step between "before the evidence" and "after the evidence". A scientist running a third experiment uses the posterior from the second experiment as the prior for the third. The accumulated weight of all prior data lives inside that single number.

A worked example — medical screening

Suppose a disease affects 1 in 100 people in the population at risk, so the prior P(D) = 0.01. A diagnostic test has 95% sensitivity (P(positive | D) = 0.95) and 95% specificity (so P(positive | not D) = 0.05). Your patient tests positive. What is P(D | positive)?

The marginal probability of a positive test is 0.95 × 0.01 + 0.05 × 0.99 = 0.0095 + 0.0495 = 0.059. Plug everything into Bayes' theorem: P(D | positive) = 0.95 × 0.01 / 0.059 ≈ 0.161. About 16% — not 95%. The intuition trap is to read the test's sensitivity as the probability the patient is sick given a positive result; that confuses P(positive | D) with P(D | positive). Because the disease is rare, the small false-positive rate of 5% acting on the 99% of healthy people generates more positives than the true-positive rate acting on the 1% of sick people. The prior dominates. Change the prior to 50% — say the patient already has symptoms — and the same positive test now yields a posterior of about 95%.

A worked example — kangaroos and Australia

Priest's neat illustration of inverse probability uses a wild kangaroo. P(Australia | wild kangaroo) is close to 1 — almost every wild kangaroo on Earth lives there. But P(wild kangaroo | Australia) — pick a random Australian and ask whether they are a kangaroo — is tiny. The two conditionals point in opposite directions and Bayes' theorem is the bridge: P(wild kangaroo | Australia) = P(Australia | wild kangaroo) × P(wild kangaroo) / P(Australia). The high inverse conditional is dragged back down by the very small prior probability that any given creature is a wild kangaroo. The example is silly on purpose; the same structure is hidden inside every serious confusion between P(evidence | hypothesis) and P(hypothesis | evidence).

The prosecutor's fallacy

The most dangerous real-world version of confusing a conditional with its inverse turns up in courtrooms. A forensic expert testifies that the probability of a DNA match given an innocent defendant is one in a million. A jury hears this as the probability the defendant is innocent given the match — that is the prosecutor's fallacy. The two are linked by Bayes' theorem, and the gap between them is the prior. If the defendant was picked out of a database of ten million people on the strength of the match alone, the prior probability of guilt was one in ten million, and even a one-in-a-million match leaves the posterior probability of guilt around 10%. The "one in a million" evidence does not mean "one in a million chance of innocence". Bayes' theorem makes the correct calculation routine; ignoring it has cost real people their freedom.

Why priors are unavoidable — the Argument to Design

Priest uses Bayes' theorem to deflate a famous theological argument. The Argument to Design claims that the ordered universe o is evidence for a creator g because P(o | g) greatly exceeds P(o | not g). Granted — but the argument actually requires P(g | o) to exceed P(not g | o), and Bayes' theorem says that holds only if the prior P(g) is at least as large as P(not g). There is no neutral reason to grant that. Among the vast space of possible universes very few are significantly ordered, which is precisely the observation the argument tries to exploit — but the same scarcity makes a designer the less probable hypothesis a priori. The argument's apparent force comes entirely from confusing P(o | g) with P(g | o). The point generalises: any inductive argument that ignores its prior is hiding a thumb on the scale.

Inverse reasoning is not doomed, though. Watch two roulette wheels, one secretly biased toward red. With no reason to favour one wheel over the other, each gets prior 1/2, and Bayes' theorem then lets a string of red spins tip the posterior firmly toward "wheel A is the biased one". The method works exactly when the priors can be defended.

How conditional probability composes

Haigh's topic in Probability VSI puts Bayes' theorem in the wider toolkit of probabilistic bookkeeping. Probability is not a single number attached to an event in isolation — it is a number attached to an event in a context, and the context shifts as new information arrives. The Multiplication Law, the Addition Law, independence, and conditional probability are the tools that track those shifts. Bayes' theorem is the device that converts P(B | A) into P(A | B), letting you keep your books in order as evidence rolls in.

A central pitfall is treating dependent events as if they were independent. Half the engineering students may be female and one in five students may study engineering, but the fraction of female engineers is much smaller than the naive product 1/10 suggests — gender and choice of degree are correlated. Multiplying conditional probabilities without checking dependence has produced wrongful convictions, misdiagnoses, and bad policy. Bayes' theorem is one half of the cure; the other half is remembering to ask "conditional on what?" before quoting any number.

Bayesian vs frequentist framing

Bayes' theorem is mathematically uncontroversial — every statistician of every school accepts it as a consequence of the probability axioms. The split between Bayesian and frequentist statistics is about what kinds of things probabilities apply to. Frequentists reserve probability for long-run frequencies of repeatable events; the probability that this particular hypothesis is true is not, on their account, a meaningful quantity. Bayesians interpret probability as a degree of belief, so it is perfectly sensible to assign a probability to a one-off hypothesis and to update that probability with Bayes' theorem.

The practical consequence is that Bayesians use the theorem ubiquitously — for parameter estimation, model selection, and decision under uncertainty — while classical frequentist methods (confidence intervals, p-values, hypothesis tests) avoid putting probabilities on hypotheses and instead reason about the behaviour of estimators across hypothetical repetitions of the experiment. Modern data analysis increasingly draws from both traditions; the theorem itself is neutral. Where it does favour the Bayesian side is in applied inference: any time you want a direct probability that a defendant is guilty, a patient is sick, or a model is correct, only the Bayesian reading delivers it.

Where it goes next

Continue exploring

Tags