ETL Pipelines — Basics
ETL / Data Pipelines — Basics
Section titled “ETL / Data Pipelines — Basics”ETL vs ELT
Section titled “ETL vs ELT”- ETL (Extract-Transform-Load) — transform before loading. Classic, on-prem, typed targets.
- ELT (Extract-Load-Transform) — load raw to warehouse, transform in-warehouse. Modern cloud, BigQuery / Snowflake / Redshift.
Today’s data stack is mostly ELT.
Batch vs streaming
Section titled “Batch vs streaming”- Batch — periodic jobs (daily / hourly). Simple, cheaper.
- Streaming — continuous, low latency.
- Micro-batch — small batches every minute or so. Compromise.
Choose by latency requirement. Most analytics is fine with batch.
Components
Section titled “Components”- Source — DB (CDC), API, files (S3/GCS), logs, queues.
- Ingestion — extract data. Tools: Fivetran, Airbyte, Stitch, custom.
- Storage / lake — S3 / GCS / Azure Blob (raw zone, often Parquet).
- Warehouse — Snowflake, BigQuery, Redshift, Databricks SQL.
- Transform — SQL (dbt), Spark, Python, Beam.
- Orchestrator — Airflow, Dagster, Prefect, Temporal.
- Quality — Great Expectations, Soda, dbt tests.
- BI / consumer — Looker, Tableau, Superset, Metabase.
- Catalog / lineage — DataHub, OpenLineage, Marquez, Atlan.
File formats for analytics
Section titled “File formats for analytics”- Parquet — columnar, compressed, splittable. Default.
- ORC — similar to Parquet, more Hadoop-aligned.
- Avro — row-based, schema-evolving. Ingestion / streaming.
- JSON / CSV — interchange, not for analytical scale.
Compression: snappy (default), zstd (better ratio), gzip (slower).
Partitioning + bucketing
Section titled “Partitioning + bucketing”Storage layout for fast querying:
- Partition by date / region — directories
dt=2026-05-10/. Pruned at query time. - Bucket / cluster — files split by hash of column. Helps joins.
- Avoid tiny files (< 128MB each) — overhead.
CDC (Change Data Capture)
Section titled “CDC (Change Data Capture)”Detect changes in source DB, push to downstream. Patterns:
- Log-based: read DB WAL/binlog → events. Debezium is the standard.
- Timestamp-based: query
WHERE updated_at > last_run. Simple but misses deletes. - Trigger-based: DB triggers populate audit table. Invasive.
Downstream → Kafka → consumer → warehouse / search index / cache.
Idempotency in pipelines
Section titled “Idempotency in pipelines”Re-runs must converge:
- Deterministic outputs — same input → same output.
- Upsert (MERGE) by primary key — re-running is safe.
- Track watermark — last processed timestamp per source.
- Backfill mode — replay specific date ranges without duplicating.
- Date-partition: rebuilds replace whole partition atomically.
Common pipelines
Section titled “Common pipelines”Modern ELT (canonical)
Section titled “Modern ELT (canonical)”DB / API / Logs → Fivetran/Airbyte → S3 (raw) → Snowflake/BigQuery → dbt → marts → BICDC streaming
Section titled “CDC streaming”Postgres WAL → Debezium → Kafka → Spark/Flink → Iceberg → TrinoML feature pipeline
Section titled “ML feature pipeline”Raw events → batch + stream → Feature Store (Feast, Tecton) → Model servingSQL transformations (dbt)
Section titled “SQL transformations (dbt)”- Models = SELECT statements; dbt builds a DAG.
- Tests: not_null, unique, accepted_values, relationships.
- Docs auto-generated.
- Snapshots for slowly-changing dimensions.
- Macros (Jinja) for reuse.
- Recommended for any warehouse-based stack.
Streaming engines
Section titled “Streaming engines”- Kafka Streams — JVM, simple, embedded.
- Apache Flink — stateful, low latency, exactly-once semantics, Java/Scala.
- Spark Structured Streaming — micro-batch + continuous, mature.
- Materialize / RisingWave — incremental SQL views over streams.
- ksqlDB — SQL on Kafka.
Orchestrators
Section titled “Orchestrators”- Airflow — most common, mature, Python DAGs. Heavy.
- Dagster — modern, asset-aware, type-safe.
- Prefect — pythonic, flow-based.
- Temporal — durable workflow engine, great for sagas + reliable jobs.
- Argo Workflows — K8s-native.
Data quality
Section titled “Data quality”- Schema validation (Avro/JSON Schema).
- Constraint checks: row count, freshness, nullability, uniqueness.
- Anomaly detection.
- Tools: Great Expectations, dbt tests, Soda, Monte Carlo.
Common interview Qs
Section titled “Common interview Qs”- Batch vs streaming when? Latency requirement; fault tolerance need.
- How to handle late-arriving data in streaming? Watermarks + windows; allow late firing within tolerance.
- Schema evolution? Avro with schema registry; backwards-compat rules; never reuse field IDs.
- Backfill a year of data? Idempotent pipeline + parallel partitions; throttle to avoid source DB pressure.
- CDC with Debezium — guarantees? At-least-once; consumers must dedupe by primary key.
- Why not query the OLTP DB directly for analytics? Locks, slow analytical queries, blast radius.
- Why columnar storage? Most analytical queries touch few columns; column compression; vectorized scans.
- Star vs snowflake schema? Star: denormalized facts + dim tables. Snowflake: dims further normalized. Star usually wins for BI.