ETL Pipelines — Theory
ETL / Data Pipelines — Theory (concise)
Section titled “ETL / Data Pipelines — Theory (concise)”Modern data stack
Section titled “Modern data stack”Source ─► Ingestion ─► Lake (S3 / Iceberg / Delta / Hudi) ─► Warehouse ─► Transform ─► Marts ─► BI (dbt)- Ingestion: Fivetran, Airbyte, Debezium (CDC), custom.
- Lake: object storage with table format (Iceberg / Delta / Hudi) for ACID-on-S3.
- Warehouse: Snowflake, BigQuery, Redshift, Databricks SQL.
- Transform: dbt (SQL DAG).
- Orchestrator: Airflow, Dagster, Prefect, Temporal.
- Quality: Great Expectations, dbt tests, Soda.
- Catalog / lineage: DataHub, OpenLineage, Marquez.
Lambda vs Kappa
Section titled “Lambda vs Kappa”- Lambda: separate batch + speed (real-time) layers.
- Kappa: single streaming layer; replay from log for historical.
Modern: Kappa-ish via streaming + lakehouse formats; pure batch still common.
Watermarks
Section titled “Watermarks”A watermark is “all data up to time T has arrived”. Beyond watermark = late data. Strategies:
- Drop late.
- Hold window open (allowed lateness).
- Side output for separate handling.
Exactly-once
Section titled “Exactly-once”Achievable with:
- Idempotent producer + transactional sink (Kafka EOS).
- Two-phase commit sinks (Flink → Kafka or JDBC).
- Application-level dedupe by event ID.
If sink isn’t transactional, design consumer to be idempotent (upsert by primary key).
Slowly Changing Dimensions
Section titled “Slowly Changing Dimensions”- SCD Type 1: overwrite.
- SCD Type 2: add row with valid_from / valid_to.
- SCD Type 3: previous-value column.
- dbt snapshots automate Type 2.
Star vs snowflake
Section titled “Star vs snowflake”- Star: facts + denormalized dims. Query simple.
- Snowflake: dims further normalized. Less duplication, more joins.
- BI tools usually prefer star.
Idempotency
Section titled “Idempotency”Re-runs converge:
- Date-partitioned outputs replace partition.
- MERGE / upsert by primary key.
- Watermark stored, advanced atomically.
Common interview Qs
Section titled “Common interview Qs”- CDC vs polling? CDC real-time, captures deletes, lower DB load.
- Backfill year of data? Idempotent pipeline; partition replays; throttle source.
- Schema evolution streaming? Schema registry (Avro, Protobuf). Backwards-compat rules.
- Why columnar? Analytics scans subset of columns; compression; vectorized engines.
- Why not OLTP DB for analytics? Locks, blast radius, slow OLAP.
- dbt vs raw SQL? DAG, tests, docs, env refs, snapshots, lineage.
- Airflow vs Dagster vs Temporal? Airflow workhorse; Dagster asset-first; Temporal durable workflows with retries.
- Spark vs Flink? Spark batch + micro-batch; Flink true streaming + stateful ops.
Anti-patterns
Section titled “Anti-patterns”- One huge orchestrator job overnight.
- No data quality tests.
- Missing schema versioning.
- Producer changes without consumer notice.
- OLTP DB for analytics.
- Hand-edited SQL not in version control.
- No lineage.
- Long-running streaming job without checkpointing.