3 Historical sketch

8 min read

Core idea

Probability did not arrive fully formed. It was built up over four centuries by gamblers, actuaries, astronomers, biologists, and finally pure mathematicians, each pulling the subject toward their own problem. The historical arc has two intertwined threads. One is technical: counting outcomes, then summing infinite series, then proving limit theorems, then axiomatising the whole edifice on measure theory in 1933. The other is philosophical: what does the number we call a probability actually mean — a symmetry of physical objects, a long-run frequency, a personal degree of belief, or an abstract measure on a sample space? The same calculation can be defended from any of these positions, and the history of probability is, just beneath the surface, the history of an unresolved argument about what the calculation is about.

Author's framing: The same formula can be read four different ways. The history of probability is the slow process of building the formula while still arguing about what it refers to.

Why it matters

Knowing the history is not decoration. It explains why probability has multiple legitimate interpretations and why working statisticians switch between them depending on the problem. When a frequentist quotes a confidence interval and a Bayesian quotes a credible interval, they are not making the same statement — and the difference traces directly back to whether you stand with Bernoulli (long-run frequencies are the real thing) or with Bayes and Laplace (probabilities are degrees of belief, updatable by evidence). Cardano's gambling questions, Pascal and Fermat's letters, Laplace's monumental synthesis, and Kolmogorov's axioms are not antiquarian trivia; they are the layers underneath every probability statement you read today.

Key takeaways

Mental model — five milestones across four centuries

A four-act story

Early developments: gamblers learn to count (1500s-1650s)

The first questions about chance came from people with money on the table. Girolamo Cardano, a 16th-century Italian polymath, gambler, and physician, wrote Liber de Ludo Aleae (Book on Games of Chance) — the first systematic treatment of dice problems — though it was not published until 1663, long after his death. Around 1600 a different group of Florentine dice players ran into a puzzle that perfectly captures the era's confusion. A popular game scored three ordinary dice. There appear to be six combinations summing to nine (6+2+1, 5+3+1, 4+4+1, 4+3+2, 3+3+3, 5+2+2) and six combinations summing to ten — so a total of ten ought to occur as often as nine. Players noticed it did not. They asked Galileo.

Galileo's answer was a counting lesson that still echoes today. Imagine the dice coloured red, green, and blue. Then 3+3+3 is a single ordered outcome, (3,3,3); but 5+2+2 is three (the 5 can fall on any die), and 6+2+1 is six (every permutation of three distinct numbers). Count ordered outcomes, not abstract combinations, and ten genuinely is more probable than nine. The Florentine gamblers had discovered the prerequisite for all of probability: you must count the right thing.

Classical period: Pascal, Fermat, and the Bernoullis (1654-1812)

In the summer of 1654, Pascal in Paris and Fermat in Toulouse exchanged a now-famous series of letters on the problem of points: Smith and Jones agree to play until one wins three games; when their match is interrupted with Smith leading 2-1, how should the prize be split? Pascal and Fermat independently arrived at the same recipe — split the prize in the ratio of the probabilities each player would win if play continued under fair conditions. For this case, Smith should get three-quarters. The technique generalised to any score, any target. Probability had a method.

Pascal also produced the first famous application of subjective probability — his wager on God's existence. The two possibilities ("God is" / "God is not") are not repeatable trials with countable outcomes; the probabilities have to come from somewhere else, namely the bettor's personal judgement. Pascal pioneered both interpretive traditions in the same decade.

Three generations of the Bernoulli family in Basel pushed the subject forward through ferocious sibling and cousin rivalry. The decisive result came from Jacob Bernoulli's posthumous Ars Conjectandi (1713) — the Law of Large Numbers. Given a repeatable experiment with a fixed objective probability p of success, the observed proportion of successes over many trials converges to p in a precise and quantifiable sense. For any tolerance you choose (say, within 2% of the true value) and any odds you want against being outside it (say, a thousand-to-one), enough trials will deliver the guarantee. The Law of Large Numbers is the formal bridge between theory and observation.

Abraham de Moivre, a French Huguenot exiled to London, extended Bernoulli's work in his Doctrine of Chances (1718, expanded 1738). He showed that the deviation of an observed count from its expectation scales as the square root of the number of trials — a result so important that Isaac Newton himself reportedly sent enquirers to de Moivre. The same work introduced the normal curve as the limiting shape of the binomial distribution, giving probability its first general-purpose continuous distribution.

Pierre-Simon Laplace then carried the classical programme to its peak. From 1774 to 1812 he synthesised the field in Theorie Analytique des Probabilites, applying inverse probability to demographic data (he concluded from Paris birth records that the chance of a male birth slightly exceeded one half, beyond reasonable doubt) and to celestial mechanics. Laplace's classical definition — probability is the ratio of favourable cases to total equally-likely cases — became the canonical textbook formulation for over a century.

Frequentist rise: from biology to inference (1860s-1930s)

The 19th century shifted the centre of gravity from gambling and astronomy to biology and social statistics. Francis Galton, a Victorian polymath and cousin of Darwin, studied heredity and discovered regression to the mean while measuring the heights of parents and offspring. To quantify the strength of association, he and his protege Karl Pearson developed correlation. Pearson founded the first modern statistics department (University College London, 1911) and introduced the chi-squared test for goodness of fit. R.A. Fisher, working at Rothamsted agricultural research station, built up the machinery of modern inference — maximum likelihood estimation, analysis of variance, the design of experiments, and the concept of statistical significance. By the 1930s, Fisher's Statistical Methods for Research Workers had turned probability into the engine of an applied discipline.

This new tradition was emphatically frequentist. Probability meant long-run relative frequency, and personal degrees of belief had no place in scientific inference. Fisher in particular regarded Bayes-style use of prior probabilities as illegitimate when the prior could not itself be derived from data. The Bayesian school did not disappear — Harold Jeffreys defended it through the same decades — but it was a minority view by the 1930s.

Modern foundations: Kolmogorov axiomatises (1933)

The classical and frequentist developments left probability mathematically uneasy. What is a probability, formally? Definitions in terms of "equally likely cases" assume the thing they want to define. Definitions in terms of long-run frequency depend on limits that need justifying. In 1933 the Russian mathematician Andrey Kolmogorov ended the matter — at least technically — by publishing Grundbegriffe der Wahrscheinlichkeitsrechnung. He set out three axioms over a sample space and a sigma-algebra of events:

Every event has a probability between 0 and 1.
The whole sample space has probability 1.
The probability of a countable disjoint union of events equals the sum of their probabilities.

That is all. From these three axioms, every result of classical, frequentist, and Bayesian probability can be derived as a theorem. The axioms say nothing about what probability means — they specify only the rules it must obey. The interpretive debate continued, but the mathematical foundation was finally watertight.

Practical application

Spot the interpretation. Is the claim about repeated trials of the same experiment (frequentist) or about a one-shot proposition like "this defendant is guilty" (Bayesian / subjective)?
Locate the calculation in the historical lineage. Counting equally-likely outcomes is Pascal-Fermat. Long-run averages converging to true probability is Bernoulli. Sample-size scaling as the square root is de Moivre. Updating from data to an unknown parameter is Bayes-Laplace. Axiomatic manipulation that does not commit to either interpretation is Kolmogorov.
Pick the right toolkit. For repeatable scientific experiments with no strong prior, Fisher-style frequentist inference is the default. For unique events, prior knowledge, or decision-making under uncertainty, Bayesian methods are usually clearer. For pure theory (derivations, proofs), Kolmogorov's axioms are the lingua franca.

Example: a courtroom DNA match

A laboratory reports that DNA from a crime scene matches the suspect, and that the random-match probability for an unrelated person in the population is one in a million. The prosecutor concludes the suspect is almost certainly guilty.

This is the prosecutor's fallacy — and it is the same confusion Pascal and Fermat would have warned against. The lab has reported P(match given innocence) ~ 1 in a million. That is a frequentist statement about the technique's behaviour across hypothetical innocent suspects. The court must judge P(innocence given match), which is not the same number. To convert one to the other you need Bayes' Rule plus a prior — typically the probability that this suspect, given everything else investigators know, was at the scene. In a large city with millions of innocent residents, the posterior probability of innocence given a match can still be substantial.

The conceptual machinery for getting this right is exactly the synthesis Bayes proposed in 1764 and Laplace operationalised in 1812. Two-and-a-half centuries later, courtrooms still get it wrong, because the prosecutor's quoted number sounds objective and the philosophical move from frequency to belief still feels foreign.

History of Probabilitylinked concept
Kolmogorov Axiomslinked concept
Pascal-Fermat Correspondencelinked concept
Laplacelinked concept
Bayesian-Frequentist Debatelinked concept

3 Historical sketch

Core idea

Why it matters

Key takeaways

Mental model — five milestones across four centuries

A four-act story

Early developments: gamblers learn to count (1500s-1650s)

Classical period: Pascal, Fermat, and the Bernoullis (1654-1812)

Frequentist rise: from biology to inference (1860s-1930s)

Modern foundations: Kolmogorov axiomatises (1933)

Practical application

Example: a courtroom DNA match

Continue exploring

Tags

3 Historical sketch

Core idea

Why it matters

Key takeaways

Mental model — five milestones across four centuries

A four-act story

Early developments: gamblers learn to count (1500s-1650s)

Classical period: Pascal, Fermat, and the Bernoullis (1654-1812)

Frequentist rise: from biology to inference (1860s-1930s)

Modern foundations: Kolmogorov axiomatises (1933)

Practical application

Example: a courtroom DNA match

Related material

Related concepts

Continue exploring

Tags