Skip to content

PostgreSQL — Theory

  • Tuple has hidden xmin (created), xmax (deleted), ctid (location).
  • UPDATE = new tuple + old marked dead.
  • Readers don’t block writers; writers don’t block readers.
  • Cost: dead tuples → bloat → vacuum needed.
  • Autovacuum runs continuously. VACUUM FULL rewrites table (locks!).
  • xid wraparound: 32-bit. Past ~2B txns → freeze old. Monitor datfrozenxid.
LevelDirtyNon-repeatablePhantomSerialization anomaly
RC
RR✗ (PG snapshot)
Serializable
  • RC: each statement own snapshot. Bug: SELECT ... ; UPDATE ... value can change.
  • RR: snapshot at first statement. Concurrent update → 40001 retry.
  • Serializable: SSI detects RW dependency cycles.
  • Row: FOR UPDATE, FOR NO KEY UPDATE, FOR SHARE, FOR KEY SHARE.
  • Table: 8 levels, ACCESS SHARE (SELECT) → ACCESS EXCLUSIVE (DROP/ALTER).
  • Advisory: pg_advisory_lock(key) — app-level, survives statements.
  • Deadlock = abort one txn. Acquire locks in same order.
  • Cost-based. Stats from ANALYZE/autoanalyze.
  • EXPLAIN (ANALYZE, BUFFERS) for actual + cache hits.
  • Bad estimates: skewed data, correlated cols. Fix: ALTER TABLE ... STATISTICS, CREATE STATISTICS.
  • Function on column kills B-tree → expression index.
  • Implicit cast: int_col = '1' may bypass.
  • NOT IN, <>, LIKE '%foo' typically scan.
  • Index-only scan needs all cols + vacuum-current visibility map.
  • Declarative (10+): RANGE, LIST, HASH.
  • Pruning at plan + execution.
  • Indexes per-partition (cascade via parent).
  • Drop old = DROP TABLE (instant) vs DELETE (slow + bloat).
  • Streaming (physical, WAL): byte-identical. Sync or async.
  • Logical (decode WAL): per-table, version-tolerant. CDC, sharding, upgrades.
  • HA: Patroni, RDS Multi-AZ.
  • Connection pool (PgBouncer transaction mode).
  • Bloat: pg_stat_user_tables dead tuples. pg_repack for online rebuild.
  • HOT updates: if updated cols not indexed AND tuple fits page → no index update.
  • TOAST: large fields out-of-line, compressed.
  • pg_stat_statements for top queries.
  1. count(*) vs count(col) — col skips NULL.
  2. LIMIT 10 OFFSET 1000000 slow → keyset pagination.
  3. Upsert: ON CONFLICT (key) DO UPDATE.
  4. UNION dedupes (sort), UNION ALL doesn’t.
  5. jsonb vs separate cols — flexibility vs query/index ease.
  6. Online add NOT NULL on huge: nullable → backfill batched → SET NOT NULL.
  7. VACUUM vs VACUUM FULL — online reclaim vs offline rewrite.
  8. SELECT FOR UPDATE on read replica → fails (replicas read-only).
  9. Autovacuum behind → tune per-table thresholds.
  10. SSI write-skew anomaly — concurrent reads + writes that violate invariant.
  • id BIGINT GENERATED ALWAYS AS IDENTITY or UUID v7.
  • created_at/updated_at timestamptz NOT NULL DEFAULT now() + trigger.
  • Soft delete via deleted_at + partial index excluding deleted.
  • Index FK columns.

ACID is a per-transaction contract: Atomicity (all-or-nothing), Consistency (transaction takes the database from one valid state to another, respecting all declared constraints — note this “C” is not the same C as in CAP), Isolation (concurrent transactions appear to run serially; Postgres implements this with MVCC and three distinct levels — SQL standard names four, but Postgres silently maps READ UNCOMMITTED to READ COMMITTED), Durability (committed data survives crashes — usually via WAL/oplog fsync). Postgres has been ACID since the beginning.

BASE — coined by Eric Brewer alongside CAP — is the loose counterpart for distributed/scale-out systems: Basically Available (the system answers, possibly with stale data), Soft state (state may drift between replicas without input), Eventual consistency (given no new writes, replicas converge).

CAP says under a network partition you must pick C (refuse the write/read to stay consistent) or A (accept and risk divergence). PACELC (Daniel Abadi, 2010/2012) extends this: if Partition then A vs C, Else (normal operation) Latency vs Consistency. PACELC matters because partitions are rare; the daily trade-off in any replicated system is “do I wait for a quorum (consistency) or return now (latency)?” Postgres synchronous replication is PC/EC; MongoDB with w:"majority" + readConcern:"majority" is PA/EC-ish; Cassandra/DynamoDB default are PA/EL.

Gotchas

  • The “C” in ACID ≠ “C” in CAP. ACID-C is integrity constraints; CAP-C is linearizability.
  • Postgres durability depends on synchronous_commit and fsync — turning either off trades durability for throughput; rarely worth it.

Q: Explain PACELC for an audit log.

An audit log demands consistency in both modes. Pick PC/EC: Postgres with synchronous replication to at least one standby. Accept higher write latency and reduced availability during partition because lost or reordered audit events are unacceptable for a regulated system.

Sources: postgresql.org transaction-iso, Abadi (PACELC, 2012), Kleppmann DDIA ch. 7–9.


Pick Postgres when:

  • Domain has well-defined entities with foreign-key relationships.
  • You need ad-hoc queries and joins (analysts, BI, regulatory reporting).
  • Strong transactional integrity required (anything financial, identity, audit).
  • You can scale vertically + read replicas — Postgres on modern hardware easily handles 10s of TB and 10k+ TPS.

Pick Mongo when:

  • Document is the natural unit of work and rarely joined.
  • Schema varies widely across rows (sparse attributes).
  • You genuinely need horizontal write scaling beyond a single node.
  • Team prefers dynamic schemas and quick iteration.

A useful rule from Kleppmann (DDIA ch. 2): if your data has many-to-many relationships, the relational model usually wins because joins are first-class; if it’s a tree of one-to-many (a document with embedded line items, comments, tags) and you fetch the whole tree, documents win because there’s no join cost.

For a regulated/enterprise system: licensing/inspection records → SQL (relational, audited, joined to entities and people); document attachments/metadata → Postgres JSONB or Mongo; session store → Redis (KV); event/audit log → append-only specialized store (Postgres partitioned table or Kafka + cold store).

WorkloadBest fitWhy
Financial ledgerPostgresSERIALIZABLE, FK constraints, double-entry checks
Product catalog with variant attrsPostgres + JSONB or MongoSparse attrs; tie-breaker is whether you join to orders/inventory
Event/audit logPostgres partitioned by day, or Kafka → S3Append-only, time-range queries
Session storeRedisTTL, sub-ms KV
CMS / docsMongoDocument model fits; rarely joined
GeospatialPostGIS on PostgresMature spatial joins, GIST indexes

Gotchas

  • “We’ll outgrow Postgres” is almost always wrong — you’ll outgrow your indexing strategy first.
  • Mongo’s “schemaless” is a lie at scale — you end up enforcing schema in app code or via JSON Schema validators.
  • Joins in Mongo ($lookup) are slow if the foreign collection isn’t indexed on the join field.
  • Postgres can do JSONB and relational — you don’t have to pick.
  • Operational cost: managing a Mongo sharded cluster (config servers, mongos, shards) is materially harder than scaling Postgres vertically + replicas.

Q: When does NoSQL win?

When the unit of read/write is a self-contained document fetched by ID, and you need horizontal write scale beyond what one Postgres node can do. Example: IoT telemetry firehose with 100k writes/sec across many devices, each device’s recent state read independently — Mongo with a hashed shard key on deviceId distributes both reads and writes evenly.


JSONB is Postgres’s binary-decomposed JSON type (since 9.4). Unlike json (which stores raw text and re-parses on every access), jsonb parses on insert, deduplicates keys, drops insignificant whitespace, and stores in a binary format that supports fast extraction and indexing.

Operators: -> extracts value as jsonb, ->> extracts as text, @> is containment (“does the LHS contain the RHS?”), ?/?|/?& test top-level key existence, @? and @@ evaluate jsonpath expressions.

Indexed with GIN using either:

  • jsonb_ops (default) — supports @>, ?, ?|, ?&, @?, @@; larger index.
  • jsonb_path_ops — supports only @>, @?, @@; smaller and faster but no key-existence queries.

The often-overlooked point: Postgres JSONB gives you Mongo’s flexible-schema benefit while keeping the relational engine — you can JOIN a normalized users table to a JSONB preferences column in the same query and wrap the whole thing in a SERIALIZABLE transaction.

Gotchas

  • jsonb_path_ops does not support ?, ?|, ?& — pick the operator class to match your queries.
  • The ? operator only checks top-level keys. '{"a":{"b":1}}'::jsonb ? 'b' is false.
  • GIN indexes are slower to update than B-tree; very write-heavy JSONB columns can be a bottleneck. Tune gin_pending_list_limit / fastupdate.
  • JSONB key order is not preserved; if you need to round-trip exact text, use json instead.
  • Don’t “JSONB-everything.” Columns that always exist and have a fixed type belong in proper columns — smaller rows, better stats, FK constraints.

Sources: postgresql.org datatype-json.


Postgres ships six index access methods:

  • B-tree (default) — handles =, <, <=, >, >=, BETWEEN, IN, IS NULL, anchored LIKE 'foo%'; can also serve ORDER BY directly.
  • Hash (WAL-logged and crash-safe since PG 10) — supports only =; rarely worth choosing over B-tree.
  • GIN (Generalized Inverted Index) — for composite values — JSONB, arrays, full-text tsvector.
  • GiST (Generalized Search Tree) — framework for geometry, ranges, trigrams; supports nearest-neighbor ORDER BY col <-> point.
  • SP-GiST — for non-balanced structures (quadtrees, tries).
  • BRIN (Block Range INdex) — stores per-block min/max; tiny on disk, ideal for very large tables where indexed column correlates with physical row order (time-series).

Beyond access methods, four index shapes matter:

  • Multicolumn indexes obey the leftmost-prefix rule (Markus Winand): an index on (a,b,c) serves predicates on a, (a,b), (a,b,c) — but not b or c alone. Choose the leading column to match your most common predicate, not “the most selective column” — selectivity matters only after the prefix rule lets you use the index.
  • Partial indexes include a WHERE clause and only index matching rows (CREATE INDEX ON orders(id) WHERE status='open' is tiny if 99% are closed).
  • Expression indexes index the result of a function (LOWER(email)).
  • Covering indexes with INCLUDE add non-key payload columns to enable index-only scans without bloating the search tree.

Gotchas

  • Leftmost-prefix is real: WHERE b=? AND c=? on (a,b,c) does not use the index unless a is also constrained.
  • LIKE '%foo' (leading wildcard) cannot use a B-tree — needs a trigram GiST/GIN (pg_trgm).
  • Functional predicates need expression indexes — WHERE LOWER(email)=? won’t use an index on email.
  • GIN indexes are slow to update; bulk loads are faster if you drop & recreate.
  • INCLUDE columns aren’t searchable — they only enable index-only scans. Visibility map must be up to date (VACUUM) for index-only scans to actually skip the heap.

Q: How do you decide column order in a compound index?

Start with what’s always in the WHERE clause (the prefix has to be usable). Among those, put the equality predicates first, then the column you want to sort/range on. This is essentially Mongo’s ESR, and it’s universal — Winand calls it the leftmost-prefix rule. Selectivity only breaks ties.

Q: How would you read EXPLAIN ANALYZE on a slow query?

Top-down: (1) the access method on the leaf — Seq Scan on a big table is the first red flag; (2) Rows Removed by Filter — if huge, the index isn’t selective enough; (3) gap between estimated rows and actual rows — bad stats, run ANALYZE; (4) Buffers: shared read= vs hit= — high reads mean cache misses; (5) sort spilling to disk (external merge Disk:). Then look for missing indexes, partial-index opportunities, or rewrites that change the access path.

Sources: postgresql.org indexes-types, use-the-index-luke.com.


Postgres uses MVCC: every row version carries xmin/xmax transaction IDs and visibility is decided per snapshot. Readers never block writers; writers never block readers.

Three offered isolation levels:

  • READ COMMITTED (default) — every statement sees its own snapshot of committed data; allows nonrepeatable and phantom reads.
  • REPEATABLE READ — one snapshot for the whole transaction. Postgres also blocks phantoms, exceeding the SQL standard; raises could not serialize access due to concurrent update on conflicts.
  • SERIALIZABLE using SSI (Serializable Snapshot Isolation, Cahill et al.) — adds non-blocking predicate locks (SIReadLock) that detect read/write dependency cycles and abort one transaction with could not serialize access due to read/write dependencies. Apps must implement retry loops.

Gotchas

  • Postgres SERIALIZABLE never deadlocks but will abort with serialization failures — you must retry.
  • Postgres sequences are not transactionalnextval increments are not rolled back. Don’t expect contiguous IDs.
  • INSERT ... ON CONFLICT in READ COMMITTED can affect rows not visible in the snapshot — docs explicitly call this out.

Q: Walk me through SSI.

At REPEATABLE READ + dependency tracking, Postgres watches read/write conflicts between concurrent transactions via SIReadLock predicate locks. If it detects a cycle in the dependency graph that would prevent any serial ordering, it aborts one transaction. The locks don’t block — they only mark dependencies — so SSI gives serializable correctness without the throughput hit of strict 2PL.


Deep dive — replication and partitioning

Section titled “Deep dive — replication and partitioning”

Postgres offers two replication models:

  • Physical (streaming) replication ships WAL byte-for-byte from primary to standbys; standbys can be warm (no queries) or hot (read-only queries). All-or-nothing (whole cluster) and replicas must be same major version.
  • Logical replication (PG 10+) uses a publish/subscribe model on (publication, subscription) pairs — table-level granularity, supports cross-version replication, foundation for CDC pipelines (Debezium reads the logical decoding stream).

Sync vs async is per-standby (synchronous_standby_names); sync gives durability (commit waits for standby ack) at the cost of latency.

Declarative partitioning (PG 10+) divides one logical table into RANGE/LIST/HASH partitions; the planner does partition pruning at plan- and execution-time. For true horizontal sharding, Citus turns Postgres into a distributed database with a coordinator + workers.

Gotchas

  • Postgres logical replication does not replicate DDL, sequences, or large objects — you migrate schema separately; sequences need manual sync at cutover.
  • Postgres physical streaming replication is async by default — failover can lose committed transactions. Use synchronous + at least one quorum standby for regulated workloads.

Q: How would you scale Postgres before reaching for sharding?

In order: tune queries and indexes; vertical scale (Postgres on modern NVMe + lots of RAM goes very far); add read replicas for read-heavy workloads; partition large hot tables; introduce CDC via logical replication for analytics offload; only then consider Citus or a different store. Sharding is operational complexity — defer it.

Sources: postgresql.org high-availability, postgresql.org logical-replication, postgresql.org ddl-partitioning.


A slow query is a measurable problem, not a vibe. Disciplined path: measure → explain → fix → verify.

Enable pg_stat_statements (extension) and auto_explain (loadable module).

  • pg_stat_statements gives normalized query fingerprints with calls, total_exec_time, mean_exec_time, rows, shared_blks_hit/read — sort by total_exec_time DESC to find queries that cost the most aggregate time. A 50ms query running 10k times/min beats a 2s query running once/min.
  • auto_explain.log_min_duration = 500ms dumps a plan for every slow execution into the log.

EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) is the flight recorder. Read bottom-up (leaves first — that’s where the work happens). Canonical red flags (Winand):

  • Seq Scan on a large table when the predicate should be selective.
  • Rows Removed by Filter greatly exceeding Rows returned — the index is too coarse.
  • Gap between rows= (estimate) and actual rows= more than ~10× — stats are stale, run ANALYZE or raise default_statistics_target.
  • Sort Method: external merge Disk: 218MBwork_mem too small.
  • Buffers: shared read=… dominating hit=… — cold cache or table too big for shared buffers.
  • Nested Loop with an outer side of millions — planner thought it would be small.

Step 3 — write the next query efficiently (SQL Antipatterns, Karwin)

Section titled “Step 3 — write the next query efficiently (SQL Antipatterns, Karwin)”
  • Implicit Columns (SELECT *) bloats network and breaks index-only scans; always project explicit columns.
  • Random Selection (ORDER BY random() LIMIT 1 on a big table) is O(N) — use TABLESAMPLE or a primary-key gap technique.
  • Poor Man’s Search Engine (LIKE '%foo%') defeats B-tree indexes — use pg_trgm GIN, or full-text tsvector.
  • Spaghetti Query (one mega-SELECT with eight joins and three aggregates) — break into CTEs, materialize intermediates, or use WITH … AS MATERIALIZED in PG 12+ to force materialization.
  • Phantom Files (OFFSET 1000000 LIMIT 20) — paginate by keyset (WHERE (created_at, id) < ($last_ts, $last_id) ORDER BY created_at DESC, id DESC LIMIT 20); offset re-scans every prior row.

Re-run with the same parameters, compare executionTimeMillis / Total Time, and watch pg_stat_statements over a real workload window — single-shot improvements lie.

Gotchas

  • EXPLAIN without ANALYZE shows estimates, not reality — always use ANALYZE (it executes the query). On destructive statements wrap in a transaction and ROLLBACK.
  • pg_stat_statements parameterizes literals — WHERE id=1 and WHERE id=2 share one row. Check pg_stat_statements.track = all and that the extension is in shared_preload_libraries.
  • ANALYZE (statistics) ≠ ANALYZE (the EXPLAIN flag). First refreshes planner stats; second executes the plan.
  • Plan cache poisoning — a prepared statement that picked a generic plan for skewed data may stay bad. Postgres 14+ improves this; in PG ≤13 set plan_cache_mode = force_custom_plan for outliers.

Q: You have a query that’s been getting slower over six months. What do you do?

Pull pg_stat_statements to see if mean_exec_time drifted (data growth, plan flip) vs calls climbed (traffic). Run EXPLAIN (ANALYZE, BUFFERS) and compare against a known-good prior plan if you keep them. Check pg_stat_user_tables.n_dead_tup and last_autovacuum — bloat from missed vacuums is a frequent culprit on update-heavy tables. Look for index drift: an index that was selective at 1M rows may now match 30% of the table. Fixes layer: ANALYZE, VACUUM, rewrite query, add/replace index, partition.

Q: How do you paginate 50 million rows?

Never OFFSET. Use keyset/seek pagination: index on (created_at DESC, id DESC), page boundary carries the last (ts, id) tuple. Constant time per page regardless of depth. Trade-off: no random jumps to “page 1000” — but real users page sequentially. (Winand’s SQL Performance Explained.)

Sources: Karwin SQL Antipatterns ch. 18–20, Winand SQL Performance Explained, postgresql.org pg_stat_statements, postgresql.org auto-explain.


Deep dive — schema design (SQL Antipatterns)

Section titled “Deep dive — schema design (SQL Antipatterns)”

The pragmatic working rule from DDIA ch. 2: normalize until queries hurt, then denormalize deliberately with a plan to keep redundant copies consistent (usually via triggers, application code, or CDC).

Denormalization trades write cost (and risk of drift) for read performance. Two principled forms:

  • Materialized views (Postgres MATERIALIZED VIEW — refreshed manually or via REFRESH MATERIALIZED VIEW CONCURRENTLY).
  • Derived stores (project an event stream into Elasticsearch for search, Redis for hot reads, a wide column store for analytics).

DDIA argues this is the only honest way to denormalize at scale — treat the log as source of truth and rebuild derived views deterministically.

  1. Jaywalking — comma-separated lists in a column (tags = 'red,green,blue'). Can’t index individual values, can’t FK them, can’t enforce uniqueness. Fix: junction table. Postgres exception: text[] with a GIN index is acceptable when list is small and never joined.
  2. Naive Trees (Adjacency List)parent_id self-reference is fine for small trees but hellish for “all descendants.” Alternatives: Path Enumeration (path = '/1/4/9/'), Nested Sets, Closure Table (best general-purpose). PG-specific: ltree extension is purpose-built for hierarchies with GiST indexes.
  3. Entity-Attribute-Value (EAV)entity_id, attribute_name, value rows. Disastrous for type safety, NULL handling, query complexity. In Postgres, use JSONB instead.
  4. Polymorphic Associations — one FK column pointing at multiple parent tables. Breaks referential integrity. Fix: one junction table per parent type, or one comments table per parent.
  5. Multicolumn Attributestag1, tag2, tag3, tag4 columns. Same problem as Jaywalking. Fix: junction table.
  6. Metadata Tribbles — table-per-period (orders_2024_q1). Use declarative partitioning instead.
  7. ID Required — surrogate id everywhere even when a natural key exists. Often fine, but always declare the natural unique constraint as UNIQUE even if you also keep id.
  8. Keyless Entry — skipping foreign keys “for performance.” FKs cost almost nothing on modern hardware and they’re the only thing preventing orphan rows.

Online migrations on a hot table need care:

  • ALTER TABLE … ADD COLUMN with a non-volatile DEFAULT is metadata-only since PG 11 (free).
  • Adding NOT NULL requires a full table rewrite unless paired with DEFAULT.
  • Adding an index in production: always CREATE INDEX CONCURRENTLY (no AccessExclusiveLock, slower, can’t be inside a transaction).
  • For type changes and renames, the safe pattern is expand–migrate–contract: add new column, dual-write, backfill, switch reads, drop old.

Q: How would you model a hierarchical org chart in Postgres?

For depths ≤ ~6 levels: adjacency list + WITH RECURSIVE (simple, fits the relational model). For deeper, frequent ancestor/descendant queries: closure table — separate table of (ancestor, descendant, depth) updated by triggers. PG shortcut: the ltree extension with GiST indexes — purpose-built, well-tuned, and queries read naturally (path <@ 'top.org.team').

Q: When is denormalization the right answer?

When the read pattern is fixed, hot, and the join cost dominates the workload — after you’ve exhausted indexing and query rewrites. Pair the denormalization with a strict update path (trigger, CDC, or rebuild from event log) so you can prove no drift.

Sources: Karwin SQL Antipatterns ch. 1–9, Kleppmann DDIA ch. 2, 11–12.


Deep dive — storage engine algorithms (B-tree, LSM, Bloom, HLL)

Section titled “Deep dive — storage engine algorithms (B-tree, LSM, Bloom, HLL)”

B-trees and B+-trees (Bayer & McCreight, 1972). A balanced multi-way search tree where each node spans a disk page (typically 8 KB in Postgres). Search, insert, delete are O(log_B N) where B is fan-out (often 100–500), so a 4-level tree indexes hundreds of millions of rows with at most 4 page reads. B+-trees (used by Postgres, MySQL InnoDB, Oracle) keep all values in leaves and link leaves into a doubly linked list — what makes range scans cheap. Updates use page splits when a leaf overflows; this is where index bloat comes from on update-heavy workloads. Postgres mitigates with HOT updates (heap-only tuple — same page, no index update) when no indexed column changed, and with deduplication in PG 13+.

LSM-trees and SSTables (O’Neil 1996; SSTable from Bigtable paper 2006). Used by Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB’s WiredTiger (LSM mode), HBase. Writes go to an in-memory memtable + a WAL. When the memtable is full it’s flushed to disk as an immutable SSTable. Background compaction merges SSTables (size-tiered or leveled). Reads check memtable → recent SSTables → older SSTables, often skipping files via Bloom filters. Trade-off vs B-tree: higher write throughput at the cost of read amplification + space amplification during compaction. DDIA chart: B-trees for read-optimized, LSM for write-heavy.

Bloom filters (Bloom, 1970). A probabilistic bit array with k hash functions. add(x) sets k bits; contains(x) checks those k bits. Zero false negatives, tunable FP rate (~1% with ~10 bits/element and 7 hashes). Every LSM engine uses them to skip SSTable I/O. Postgres ships a bloom index access method (extension) — useful for “any-of-N equality columns” filters.

HyperLogLog (Flajolet et al., 2007). A streaming cardinality estimator. ~12 KB of memory → cardinality up to ~10⁹ at ~0.81% standard error. Implementations: Redis PFADD/PFCOUNT/PFMERGE, Postgres postgresql-hll extension, ClickHouse uniqHLL12. Right call when you need how many distinct but not which ones.

Skip lists, tries, count-min sketch, T-digest worth a mention: Redis ZSETs are skip lists (O(log N) ordered structure, simpler than balanced tree, easier to make lock-free); tries underpin some autocomplete indexes and ltree; count-min sketch estimates frequency the way HLL estimates cardinality; T-digest approximates quantiles (p50/p99) in bounded memory.

Q: Why is Postgres a B-tree shop while Cassandra is LSM?

Postgres optimizes for transactional read/write mixes with ordered range queries — exactly B-trees’ sweet spot. Cassandra targets very high write throughput on commodity disks where random writes would melt the hardware; LSM converts every write to a sequential append, batches them, merges in the background. DDIA framing: B-trees for balanced read/write, LSM for write-heavy.

Q: When would you use a Bloom filter in application code?

Any “is this maybe in the set, cheaply?” boundary. The CDN check before a costly origin fetch; the “have we seen this URL?” check in a crawler; the “is this email in a leaked-credentials dataset?” check. Inside the database you get them for free in LSM engines and as Postgres’s bloom extension.

Sources: Kleppmann DDIA ch. 3, Petrov Database Internals ch. 1–7, Flajolet et al. HyperLogLog (2007), Bloom (1970).