Definition
A data pipeline is the orchestrated flow of data through a sequence of stages — typically ingest, transform, validate, store, and serve — connecting raw sources to the consumers that act on the result. Each stage has a defined input, a defined output, and explicit failure semantics. The pipeline as a whole is a contract: given a stream of source records, it produces a clean, queryable, downstream-ready dataset on a known schedule or in response to a triggering event.
Pipelines are the connective tissue between raw production systems and the analytical, machine-learning, and operational surfaces that depend on derived data. They are also where most data quality, latency, and reliability problems are diagnosed and fixed — the pipeline is the layer at which the abstract idea of "trustworthy data" becomes a concrete set of jobs, monitors, and SLAs.
Why it matters
How it works
A pipeline begins with one or more ingest stages that pull from sources — files, APIs, streams, databases, event buses — and land the raw payload in a staging area. Transform stages clean, normalize, deduplicate, enrich, and aggregate the staged data into the target shape. Validation stages check that the result satisfies invariants (row counts, schema, value ranges, referential integrity) before publishing. The serving stage exposes the final dataset to consumers — a warehouse table, a feature store, a cache, a dashboard, an API. An orchestrator (Airflow, Prefect, Dagster, or a managed equivalent) sequences the stages, retries failures, and emits the observability signals.
The hard parts are rarely the individual transforms. They are the cross-cutting concerns: idempotency (re-running a stage produces no duplicates or drift), backfills (rebuilding history when logic changes), schema evolution (new columns appear, old ones get retyped), late-arriving data (records arrive after the window that should have included them closed), and lineage (which downstream artifacts depend on which upstream sources). A pipeline that handles these well becomes the trusted source of truth for an organization. A pipeline that handles them poorly becomes a tax that every analyst pays for the rest of the system's life.