Skip to content

ETL Pipelines — Basics

  • ETL (Extract-Transform-Load) — transform before loading. Classic, on-prem, typed targets.
  • ELT (Extract-Load-Transform) — load raw to warehouse, transform in-warehouse. Modern cloud, BigQuery / Snowflake / Redshift.

Today’s data stack is mostly ELT.

  • Batch — periodic jobs (daily / hourly). Simple, cheaper.
  • Streaming — continuous, low latency.
  • Micro-batch — small batches every minute or so. Compromise.

Choose by latency requirement. Most analytics is fine with batch.

  • Source — DB (CDC), API, files (S3/GCS), logs, queues.
  • Ingestion — extract data. Tools: Fivetran, Airbyte, Stitch, custom.
  • Storage / lake — S3 / GCS / Azure Blob (raw zone, often Parquet).
  • Warehouse — Snowflake, BigQuery, Redshift, Databricks SQL.
  • Transform — SQL (dbt), Spark, Python, Beam.
  • Orchestrator — Airflow, Dagster, Prefect, Temporal.
  • Quality — Great Expectations, Soda, dbt tests.
  • BI / consumer — Looker, Tableau, Superset, Metabase.
  • Catalog / lineage — DataHub, OpenLineage, Marquez, Atlan.
  • Parquet — columnar, compressed, splittable. Default.
  • ORC — similar to Parquet, more Hadoop-aligned.
  • Avro — row-based, schema-evolving. Ingestion / streaming.
  • JSON / CSV — interchange, not for analytical scale.

Compression: snappy (default), zstd (better ratio), gzip (slower).

Storage layout for fast querying:

  • Partition by date / region — directories dt=2026-05-10/. Pruned at query time.
  • Bucket / cluster — files split by hash of column. Helps joins.
  • Avoid tiny files (< 128MB each) — overhead.

Detect changes in source DB, push to downstream. Patterns:

  • Log-based: read DB WAL/binlog → events. Debezium is the standard.
  • Timestamp-based: query WHERE updated_at > last_run. Simple but misses deletes.
  • Trigger-based: DB triggers populate audit table. Invasive.

Downstream → Kafka → consumer → warehouse / search index / cache.

Re-runs must converge:

  • Deterministic outputs — same input → same output.
  • Upsert (MERGE) by primary key — re-running is safe.
  • Track watermark — last processed timestamp per source.
  • Backfill mode — replay specific date ranges without duplicating.
  • Date-partition: rebuilds replace whole partition atomically.
DB / API / Logs → Fivetran/Airbyte → S3 (raw) → Snowflake/BigQuery → dbt → marts → BI
Postgres WAL → Debezium → Kafka → Spark/Flink → Iceberg → Trino
Raw events → batch + stream → Feature Store (Feast, Tecton) → Model serving
  • Models = SELECT statements; dbt builds a DAG.
  • Tests: not_null, unique, accepted_values, relationships.
  • Docs auto-generated.
  • Snapshots for slowly-changing dimensions.
  • Macros (Jinja) for reuse.
  • Recommended for any warehouse-based stack.
  • Kafka Streams — JVM, simple, embedded.
  • Apache Flink — stateful, low latency, exactly-once semantics, Java/Scala.
  • Spark Structured Streaming — micro-batch + continuous, mature.
  • Materialize / RisingWave — incremental SQL views over streams.
  • ksqlDB — SQL on Kafka.
  • Airflow — most common, mature, Python DAGs. Heavy.
  • Dagster — modern, asset-aware, type-safe.
  • Prefect — pythonic, flow-based.
  • Temporal — durable workflow engine, great for sagas + reliable jobs.
  • Argo Workflows — K8s-native.
  • Schema validation (Avro/JSON Schema).
  • Constraint checks: row count, freshness, nullability, uniqueness.
  • Anomaly detection.
  • Tools: Great Expectations, dbt tests, Soda, Monte Carlo.
  1. Batch vs streaming when? Latency requirement; fault tolerance need.
  2. How to handle late-arriving data in streaming? Watermarks + windows; allow late firing within tolerance.
  3. Schema evolution? Avro with schema registry; backwards-compat rules; never reuse field IDs.
  4. Backfill a year of data? Idempotent pipeline + parallel partitions; throttle to avoid source DB pressure.
  5. CDC with Debezium — guarantees? At-least-once; consumers must dedupe by primary key.
  6. Why not query the OLTP DB directly for analytics? Locks, slow analytical queries, blast radius.
  7. Why columnar storage? Most analytical queries touch few columns; column compression; vectorized scans.
  8. Star vs snowflake schema? Star: denormalized facts + dim tables. Snowflake: dims further normalized. Star usually wins for BI.