ETL Pipelines — Basics

ETL / Data Pipelines — Basics

ETL (Extract-Transform-Load) — transform before loading. Classic, on-prem, typed targets.
ELT (Extract-Load-Transform) — load raw to warehouse, transform in-warehouse. Modern cloud, BigQuery / Snowflake / Redshift.

Today’s data stack is mostly ELT.

Choose by latency requirement. Most analytics is fine with batch.

Compression: snappy (default), zstd (better ratio), gzip (slower).

Storage layout for fast querying:

Partition by date / region — directories dt=2026-05-10/. Pruned at query time.
Bucket / cluster — files split by hash of column. Helps joins.
Avoid tiny files (< 128MB each) — overhead.

Detect changes in source DB, push to downstream. Patterns:

Log-based: read DB WAL/binlog → events. Debezium is the standard.
Timestamp-based: query WHERE updated_at > last_run. Simple but misses deletes.
Trigger-based: DB triggers populate audit table. Invasive.

Downstream → Kafka → consumer → warehouse / search index / cache.

Re-runs must converge:

DB / API / Logs → Fivetran/Airbyte → S3 (raw) → Snowflake/BigQuery → dbt → marts → BI

Postgres WAL → Debezium → Kafka → Spark/Flink → Iceberg → Trino

Raw events → batch + stream → Feature Store (Feast, Tecton) → Model serving

Batch vs streaming when? Latency requirement; fault tolerance need.
How to handle late-arriving data in streaming? Watermarks + windows; allow late firing within tolerance.
Schema evolution? Avro with schema registry; backwards-compat rules; never reuse field IDs.
Backfill a year of data? Idempotent pipeline + parallel partitions; throttle to avoid source DB pressure.
CDC with Debezium — guarantees? At-least-once; consumers must dedupe by primary key.
Why not query the OLTP DB directly for analytics? Locks, slow analytical queries, blast radius.
Why columnar storage? Most analytical queries touch few columns; column compression; vectorized scans.
Star vs snowflake schema? Star: denormalized facts + dim tables. Snowflake: dims further normalized. Star usually wins for BI.