Regression: describing relationships between things

4 min read

Core idea

Linear regression takes a cloud of paired observations and fits a single straight line through the middle of them — Y = a + bX, where a is the intercept (where the line meets the vertical axis) and b is the slope (how much Y changes per unit change in X). The line is chosen by the least-squares rule: it is the line for which the sum of squared vertical distances from each point to the line is as small as possible. Once you have the line, you can read off predictions: pick any X, drop down to the line, and the corresponding Y is your estimate. The catch is that those predictions are honest only inside the range of the original data.

Why it matters

Regression is the workhorse of applied statistics. Sales forecasts, economic projections, dosage calculations, school value-added scores — all rest on fitting a line (or a curve) to paired data and reading predictions off it. Understanding what the slope means in context, recognising that the intercept is sometimes literal and sometimes a mathematical artefact, and knowing where the prediction can be trusted are the everyday skills of statistically literate decision-making.

Mental model

The shape of paired data

A scattergraph plots one variable on the horizontal axis and the other on the vertical axis. The eye can immediately see whether the points trend upward, downward, or have no pattern — and whether they hug the trend tightly or scatter widely around it. Regression formalises the trend; correlation (the next topic) formalises the tightness.

The shape of paired data

Slope and intercept, geometrically

The intercept a is where the line crosses the vertical axis: it is the predicted Y when X equals zero. The slope b is rise over run: pick any two X values one unit apart, and b is the corresponding change in Y. A negative slope means Y falls as X rises.

Slope and intercept, geometrically

Interpolation vs extrapolation

Predictions inside the range of the original data lean on observed patterns; predictions outside that range assume the same pattern continues — an assumption that frequently breaks. Pulse rates rise modestly with age in adults but fall sharply with age in children, so a line fitted to adults will give absurd predictions for toddlers.

Interpolation vs extrapolation

Practical application

  1. Plot the data first. Before fitting anything, eyeball the scattergraph. If the cloud is not roughly linear, linear regression is the wrong tool — consider a transformation or a curve fit.

  2. Decide which variable is on which axis. The independent variable (cause, predictor, time) goes on the horizontal axis; the dependent variable (effect, outcome) goes on the vertical. The choice changes the slope.

  3. Let the software do the arithmetic. In a spreadsheet, use =SLOPE(y_range, x_range) and =INTERCEPT(y_range, x_range). On a graphing calculator, enter the paired data and read off a and b.

  4. Write the equation in plain language. "Average inches of rain per month = 6.7 − 0.6 × hours of sunshine per day" beats Y = 6.7 − 0.6X for any non-mathematical audience.

  5. Sanity-check predictions. Plug in the smallest and largest observed X values; do the corresponding Y values make sense? If not, the line is misleading even inside the data.

  6. Refuse to extrapolate without flagging it. When asked for a prediction outside the observed range, give the number but label it clearly as an extrapolation that assumes the linear pattern holds.

Example

Suppose a coffee shop tracks how long customers wait in line on a Monday morning, hour by hour, and the number of complaints per hour they receive. Six paired observations give a scattergraph that climbs from bottom-left to top-right with a clear linear trend. The spreadsheet returns complaints = 0.4 + 1.8 × wait minutes.

Interpretation in plain terms: even at zero wait the shop receives about 0.4 complaints per hour (intercept — perhaps from food quality or noise), and every additional minute of waiting adds about 1.8 complaints per hour to the count.

Interpolation: at the typical 3-minute wait the line predicts 0.4 + 1.8 × 3 ≈ 5.8 complaints per hour. The observed Monday data falls between 1 and 6 minutes of wait, so this prediction is honest.

Extrapolation: at a hypothetical 20-minute wait the line predicts 0.4 + 1.8 × 20 ≈ 36.4 complaints per hour. The number is meaningless. By 20 minutes most customers have left without ordering — there is nobody left to complain. The relationship that held for 1 to 6 minutes does not extend to 20 minutes, and the line cannot warn you that it doesn't.

Continue exploring

Tags