Definition
Probability is a numerical measure, on a scale from 0 to 1, of how likely an event is to occur — 0 for impossible, 1 for certain, and every value in between expressing a graded degree of expectation. It is the formal grammar of uncertainty: a way to give chance a number so that statements about the unknown can be combined, compared, and reasoned about with the same rigor as statements about the known.
The same number can be reached three different ways — by counting symmetries on a fair die, by measuring long-run frequencies from data, or by stating a personal degree of belief — and the three interpretations disagree about what the number means while agreeing on how numbers must combine. Probability is therefore both a mathematical object (a measure on a sample space, obeying the Kolmogorov axioms) and a philosophical one (a contested account of what it means for the future to be uncertain).
Why it matters
How it works
The axiomatic core
Every probability problem starts with a sample space — the complete catalog of distinct outcomes the situation can produce — and a set of events, which are subsets of that space. A probability function assigns a number between 0 and 1 to each event, subject to three rules from Kolmogorov: probabilities are non-negative, the entire sample space has probability 1, and the probability of any disjoint union is the sum of the parts. From those three rules alone, every other formula in the field can be derived: the complement rule (pr(not A) = 1 - pr(A)), the inclusion-exclusion rule for disjunctions (pr(A or B) = pr(A) + pr(B) - pr(A and B)), and the multiplication rule for conjunctions through conditional probability.
This minimal scaffolding is what unifies the discipline. Coin flips, radioactive decay, queues at a hospital, fluctuations in a stock price, and the reliability of a bridge component all live in different sample spaces, but they share the same algebra. As Haigh puts it in his Very Short Introduction, naming the sample space before reaching for a fraction is the single most valuable habit a student of probability can install — most "trick questions" in the field are not really questions about counting, they are questions about which sample space the wording implicitly assumes.
Three interpretations, one algebra
Probability is unusual among mathematical fields in that its central object has three competing philosophical interpretations, none of which has won the argument. Classical probability, the oldest, counts symmetric outcomes — favorable cases over total cases — and applies cleanly to dice, decks of cards, and lotteries where the sample space has obvious symmetries. Frequentist probability defines pr(A) as the limit of the relative frequency of A in a long sequence of repeatable trials; it is the interpretation behind clinical trial results, manufacturing defect rates, and insurance actuarial tables. Subjective Bayesian probability treats pr(A) as a personal degree of belief, constrained only by the axioms and by updating in light of evidence; it is the interpretation behind a weather forecaster's "30% chance of rain" and a court's assessment of a defendant's guilt.
The three interpretations frequently produce the same numerical answer, but they disagree about what that number is. When a startup CEO says "I think there's an 80% chance this product succeeds," a CFO replies "historical launch data puts us at 40%," and a junior analyst calculates "by symmetry there are four equally likely market reactions and three count as success, so 75%" — these are three different questions, not three competing answers. The right response is not to pick a winner but to triangulate: the gap between the CEO's subjective credence and the CFO's frequentist base rate is the value of the private information the CEO is claiming to possess, and if that information cannot be articulated the credence should regress toward the base rate.
Conditional probability and inductive validity
The decisive notion for reasoning is conditional probability, written pr(A | B) — the probability of A given that B holds. It is computed by restricting attention to the cases where B is true, then asking what fraction of those also have A: pr(A | B) = pr(A and B) / pr(B), undefined when pr(B) is zero. Conditional probability is what lets evidence move belief — it is the formal machinery behind every diagnostic test, every legal inference, and every Bayesian update.
In Graham Priest's Logic: A Very Short Introduction, conditional probability is also what rescues inductive reasoning from its weakness relative to deduction. A deductive inference guarantees its conclusion if the premises hold; an inductive inference cannot. But Priest gives induction a precise standard: an inference is inductively valid just when the conditional probability of the conclusion given the premises is greater than the conditional probability of its negation given the same premises. Sherlock Holmes's famous "deductions" are really inductions in this sense — a worn cuff makes "writes a lot for a living" more probable than not, conditional on the cuff-wear, even though it does not make it certain. Probability is the tool that turns Holmes's pattern-matching into something with a rule attached.
The reference-class problem
Every probability statement is implicitly relative to a class of cases, and the choice of class can change the number dramatically. A screening test reported as "90% accurate" is 90% accurate within some class — everyone screened, the symptomatic, a particular age band — and each class yields a different probability that a positive result means real disease. Priest's topic on probability pushes this observation to its limit. The narrowest, most specific reference class for any individual is the class containing only that individual — but then either the person has the condition or they do not, so the probability collapses to 1 or 0, and the inference can no longer be used to discover whether the condition is present. Pushed to its logical extreme, the reference-class problem threatens to make inductive validity collapse into uselessness.
Practical probability lives with this tension by accepting that the reference class is a modeling choice, not a fact. The discipline is to state the class out loud, defend the choice against alternatives, and treat probability statements as conditional on that choice. A probability without a named reference class is a number without units.
Combinatorics and the structure of finite sample spaces
For finite sample spaces, classical probability reduces to counting. The two combinatorial workhorses are permutations (where order matters) and combinations (where it does not). The 1,326 distinct two-card hands in a 52-card deck come from C(52, 2) = 52 × 51 / 2. The 64 "blackjack" hands come from 4 aces × 16 ten-cards. The probability of being dealt a blackjack is therefore 64 / 1326, just under 5%. Every classical-probability answer is a fraction whose numerator and denominator are both counting problems.
This is why combinatorics and probability are usually taught in the same topic: once the sample space is specified, the calculation reduces to enumeration. The bookkeeping can be elaborate — multi-stage experiments, conditional sub-spaces, sampling with and without replacement — but the underlying move is always the same: count the favorable outcomes, count the total, divide.
Probability as a quantitative lens for general reasoning
In The Great Mental Models, Volume 3, probability appears not as a calculation tool but as one of nine quantitative shapes the mind should learn to recognize. Alongside compounding, regression to the mean, the destructive power of multiplying by zero, and the topography of distributions, probability is treated as an intuition to be installed — a default lens for situations where the numbers are messy, missing, or unreliable.
The argument is that most reasoning errors are quantitative errors wearing other disguises. When you confuse an anecdote for evidence, you are misusing the implicit sample. When you mistake a hot streak for permanent improvement, you are forgetting regression to the mean. When you assume that ten years of preparation will pay back linearly, you are missing compounding. People who appear to be unusually good with numbers are rarely faster at arithmetic — they are faster at recognizing which shape a situation has, and therefore which quantitative intuition applies. Probability, in this framing, is less about computing the chance of an event than about noticing when a problem is governed by chance in the first place.
Distributions, expectations, and the law of large numbers
Once probabilities are attached to outcomes, random variables turn outcomes into numbers and distributions describe how those numbers spread. The headline distributions — binomial for counts of successes, Poisson for counts of rare events in time, normal for sums of many small independent contributions — are the workhorses behind nearly every applied use of probability, from quality control to clinical trials to financial risk models. The expected value of a random variable is its long-run average weighted by probability; the variance measures how spread out it is around that average.
The law of large numbers says that as the number of independent trials grows, the observed average converges to the expected value — which is why insurance works, why casinos profit from house edges over time, and why one experiment is rarely enough to settle a scientific question. The central limit theorem says that the sum of many small independent random contributions is approximately normally distributed, which is why the bell curve appears everywhere from measurement errors to heights of trees to test scores. Together these two results form the bridge from probability (reasoning forward from models to data) to statistics (reasoning backward from data to models).