Definition
ETL stands for Extract, Transform, Load — the three-step pattern that pulls data out of operational source systems, reshapes it into a schema and quality suitable for analysis, and lands the result in a destination store such as a data warehouse or mart. Extract reaches into the source (a transactional database, an API, a file drop, a log stream) and pulls the raw records. Transform cleans, normalizes, enriches, deduplicates, joins, and aggregates them. Load writes the finished records into the analytical target, where downstream tools — BI dashboards, ML pipelines, ad-hoc SQL — consume them.
ETL emerged in the 1970s data-warehousing era as a way to keep heavy analytical work off the transactional databases that ran the business. The shape of the pattern — three named stages, batch-oriented, schema-on-write — became the dominant idiom for two decades, and it is still the right architecture for many analytical workloads. The modern variants ELT (Extract, Load, Transform — push raw data into a powerful warehouse first, then transform inside it) and CDC (Change Data Capture — stream row-level changes from source databases continuously) are rearrangements of the same three operations, not replacements for the underlying idea.
Why it matters
How it works
A classical ETL job runs on a schedule — nightly, hourly, every fifteen minutes. The extract stage queries the source for changed or new records (using a timestamp watermark, an incrementing ID, or a full snapshot for small tables) and lands them in a staging area. The transform stage cleans field formats, resolves keys across systems, deduplicates, computes derived columns, builds dimensional or denormalized shapes, and validates results against expected ranges and row counts. The load stage writes the result into the target warehouse, often in a way that supports both inserts and updates (merge or upsert semantics). An orchestrator sequences the stages, retries failed runs, and emits the signals that the data team monitors.
ELT inverts the second and third steps. With modern warehouses (Snowflake, BigQuery, Redshift, Databricks) charging little for storage and offering elastic compute, it became cheaper to load raw data first and run all transforms as SQL inside the warehouse — version-controlled, code-reviewed, and tested with tools like dbt. ELT preserves the raw record as an audit trail and lets analysts iterate on transform logic without re-extracting. CDC takes a different cut: rather than poll the source on a schedule, it taps the source database's transaction log and emits each row-level change as an event, which a streaming pipeline applies to the destination in near-real time. Each variant is a response to a specific bottleneck — warehouse compute cost, freshness, source-system load — and modern organizations typically run a mixture of all three depending on what each dataset needs.