ETL Pipelines — Theory

ETL / Data Pipelines — Theory (concise)

Modern data stack

Source ─► Ingestion ─► Lake (S3 / Iceberg / Delta / Hudi) ─► Warehouse ─► Transform ─► Marts ─► BI
                                                                 (dbt)

Ingestion: Fivetran, Airbyte, Debezium (CDC), custom.
Lake: object storage with table format (Iceberg / Delta / Hudi) for ACID-on-S3.
Warehouse: Snowflake, BigQuery, Redshift, Databricks SQL.
Transform: dbt (SQL DAG).
Orchestrator: Airflow, Dagster, Prefect, Temporal.
Quality: Great Expectations, dbt tests, Soda.
Catalog / lineage: DataHub, OpenLineage, Marquez.

Lambda vs Kappa

Lambda: separate batch + speed (real-time) layers.
Kappa: single streaming layer; replay from log for historical.

Modern: Kappa-ish via streaming + lakehouse formats; pure batch still common.

Watermarks

A watermark is “all data up to time T has arrived”. Beyond watermark = late data. Strategies:

Drop late.
Hold window open (allowed lateness).
Side output for separate handling.

Exactly-once

Achievable with:

Idempotent producer + transactional sink (Kafka EOS).
Two-phase commit sinks (Flink → Kafka or JDBC).
Application-level dedupe by event ID.

If sink isn’t transactional, design consumer to be idempotent (upsert by primary key).

Slowly Changing Dimensions

SCD Type 1: overwrite.
SCD Type 2: add row with valid_from / valid_to.
SCD Type 3: previous-value column.
dbt snapshots automate Type 2.

Star vs snowflake

Star: facts + denormalized dims. Query simple.
Snowflake: dims further normalized. Less duplication, more joins.
BI tools usually prefer star.

Idempotency

Re-runs converge:

Date-partitioned outputs replace partition.
MERGE / upsert by primary key.
Watermark stored, advanced atomically.

Common interview Qs

CDC vs polling? CDC real-time, captures deletes, lower DB load.
Backfill year of data? Idempotent pipeline; partition replays; throttle source.
Schema evolution streaming? Schema registry (Avro, Protobuf). Backwards-compat rules.
Why columnar? Analytics scans subset of columns; compression; vectorized engines.
Why not OLTP DB for analytics? Locks, blast radius, slow OLAP.
dbt vs raw SQL? DAG, tests, docs, env refs, snapshots, lineage.
Airflow vs Dagster vs Temporal? Airflow workhorse; Dagster asset-first; Temporal durable workflows with retries.
Spark vs Flink? Spark batch + micro-batch; Flink true streaming + stateful ops.

Anti-patterns

One huge orchestrator job overnight.
No data quality tests.
Missing schema versioning.
Producer changes without consumer notice.
OLTP DB for analytics.
Hand-edited SQL not in version control.
No lineage.
Long-running streaming job without checkpointing.