Skip to content

ETL Pipelines — Theory

Source ─► Ingestion ─► Lake (S3 / Iceberg / Delta / Hudi) ─► Warehouse ─► Transform ─► Marts ─► BI
(dbt)
  • Ingestion: Fivetran, Airbyte, Debezium (CDC), custom.
  • Lake: object storage with table format (Iceberg / Delta / Hudi) for ACID-on-S3.
  • Warehouse: Snowflake, BigQuery, Redshift, Databricks SQL.
  • Transform: dbt (SQL DAG).
  • Orchestrator: Airflow, Dagster, Prefect, Temporal.
  • Quality: Great Expectations, dbt tests, Soda.
  • Catalog / lineage: DataHub, OpenLineage, Marquez.
  • Lambda: separate batch + speed (real-time) layers.
  • Kappa: single streaming layer; replay from log for historical.

Modern: Kappa-ish via streaming + lakehouse formats; pure batch still common.

A watermark is “all data up to time T has arrived”. Beyond watermark = late data. Strategies:

  • Drop late.
  • Hold window open (allowed lateness).
  • Side output for separate handling.

Achievable with:

  • Idempotent producer + transactional sink (Kafka EOS).
  • Two-phase commit sinks (Flink → Kafka or JDBC).
  • Application-level dedupe by event ID.

If sink isn’t transactional, design consumer to be idempotent (upsert by primary key).

  • SCD Type 1: overwrite.
  • SCD Type 2: add row with valid_from / valid_to.
  • SCD Type 3: previous-value column.
  • dbt snapshots automate Type 2.
  • Star: facts + denormalized dims. Query simple.
  • Snowflake: dims further normalized. Less duplication, more joins.
  • BI tools usually prefer star.

Re-runs converge:

  • Date-partitioned outputs replace partition.
  • MERGE / upsert by primary key.
  • Watermark stored, advanced atomically.
  1. CDC vs polling? CDC real-time, captures deletes, lower DB load.
  2. Backfill year of data? Idempotent pipeline; partition replays; throttle source.
  3. Schema evolution streaming? Schema registry (Avro, Protobuf). Backwards-compat rules.
  4. Why columnar? Analytics scans subset of columns; compression; vectorized engines.
  5. Why not OLTP DB for analytics? Locks, blast radius, slow OLAP.
  6. dbt vs raw SQL? DAG, tests, docs, env refs, snapshots, lineage.
  7. Airflow vs Dagster vs Temporal? Airflow workhorse; Dagster asset-first; Temporal durable workflows with retries.
  8. Spark vs Flink? Spark batch + micro-batch; Flink true streaming + stateful ops.
  • One huge orchestrator job overnight.
  • No data quality tests.
  • Missing schema versioning.
  • Producer changes without consumer notice.
  • OLTP DB for analytics.
  • Hand-edited SQL not in version control.
  • No lineage.
  • Long-running streaming job without checkpointing.