PostgreSQL — Theory
- Tuple has hidden
xmin(created),xmax(deleted),ctid(location). - UPDATE = new tuple + old marked dead.
- Readers don’t block writers; writers don’t block readers.
- Cost: dead tuples → bloat → vacuum needed.
- Autovacuum runs continuously. VACUUM FULL rewrites table (locks!).
- xid wraparound: 32-bit. Past ~2B txns → freeze old. Monitor
datfrozenxid.
Isolation prevents
Section titled “Isolation prevents”| Level | Dirty | Non-repeatable | Phantom | Serialization anomaly |
|---|---|---|---|---|
| RC | ✗ | ✓ | ✓ | ✓ |
| RR | ✗ | ✗ | ✗ (PG snapshot) | ✓ |
| Serializable | ✗ | ✗ | ✗ | ✗ |
- RC: each statement own snapshot. Bug:
SELECT ... ; UPDATE ...value can change. - RR: snapshot at first statement. Concurrent update →
40001retry. - Serializable: SSI detects RW dependency cycles.
- Row:
FOR UPDATE,FOR NO KEY UPDATE,FOR SHARE,FOR KEY SHARE. - Table: 8 levels,
ACCESS SHARE(SELECT) →ACCESS EXCLUSIVE(DROP/ALTER). - Advisory:
pg_advisory_lock(key)— app-level, survives statements. - Deadlock = abort one txn. Acquire locks in same order.
Query planning
Section titled “Query planning”- Cost-based. Stats from
ANALYZE/autoanalyze. EXPLAIN (ANALYZE, BUFFERS)for actual + cache hits.- Bad estimates: skewed data, correlated cols. Fix:
ALTER TABLE ... STATISTICS,CREATE STATISTICS.
Index gotchas
Section titled “Index gotchas”- Function on column kills B-tree → expression index.
- Implicit cast:
int_col = '1'may bypass. NOT IN,<>,LIKE '%foo'typically scan.- Index-only scan needs all cols + vacuum-current visibility map.
Partitioning
Section titled “Partitioning”- Declarative (10+): RANGE, LIST, HASH.
- Pruning at plan + execution.
- Indexes per-partition (cascade via parent).
- Drop old =
DROP TABLE(instant) vsDELETE(slow + bloat).
Replication
Section titled “Replication”- Streaming (physical, WAL): byte-identical. Sync or async.
- Logical (decode WAL): per-table, version-tolerant. CDC, sharding, upgrades.
- HA: Patroni, RDS Multi-AZ.
Performance hot topics
Section titled “Performance hot topics”- Connection pool (PgBouncer transaction mode).
- Bloat:
pg_stat_user_tablesdead tuples.pg_repackfor online rebuild. - HOT updates: if updated cols not indexed AND tuple fits page → no index update.
- TOAST: large fields out-of-line, compressed.
pg_stat_statementsfor top queries.
Common interview Qs
Section titled “Common interview Qs”count(*)vscount(col)— col skips NULL.LIMIT 10 OFFSET 1000000slow → keyset pagination.- Upsert:
ON CONFLICT (key) DO UPDATE. - UNION dedupes (sort), UNION ALL doesn’t.
- jsonb vs separate cols — flexibility vs query/index ease.
- Online add NOT NULL on huge: nullable → backfill batched → SET NOT NULL.
- VACUUM vs VACUUM FULL — online reclaim vs offline rewrite.
- SELECT FOR UPDATE on read replica → fails (replicas read-only).
- Autovacuum behind → tune per-table thresholds.
- SSI write-skew anomaly — concurrent reads + writes that violate invariant.
Schema design
Section titled “Schema design”id BIGINT GENERATED ALWAYS AS IDENTITYor UUID v7.created_at/updated_at timestamptz NOT NULL DEFAULT now()+ trigger.- Soft delete via
deleted_at+ partial index excluding deleted. - Index FK columns.
Deep dive — ACID, BASE, CAP, PACELC
Section titled “Deep dive — ACID, BASE, CAP, PACELC”ACID is a per-transaction contract: Atomicity (all-or-nothing), Consistency (transaction takes the database from one valid state to another, respecting all declared constraints — note this “C” is not the same C as in CAP), Isolation (concurrent transactions appear to run serially; Postgres implements this with MVCC and three distinct levels — SQL standard names four, but Postgres silently maps READ UNCOMMITTED to READ COMMITTED), Durability (committed data survives crashes — usually via WAL/oplog fsync). Postgres has been ACID since the beginning.
BASE — coined by Eric Brewer alongside CAP — is the loose counterpart for distributed/scale-out systems: Basically Available (the system answers, possibly with stale data), Soft state (state may drift between replicas without input), Eventual consistency (given no new writes, replicas converge).
CAP says under a network partition you must pick C (refuse the write/read to stay consistent) or A (accept and risk divergence). PACELC (Daniel Abadi, 2010/2012) extends this: if Partition then A vs C, Else (normal operation) Latency vs Consistency. PACELC matters because partitions are rare; the daily trade-off in any replicated system is “do I wait for a quorum (consistency) or return now (latency)?” Postgres synchronous replication is PC/EC; MongoDB with w:"majority" + readConcern:"majority" is PA/EC-ish; Cassandra/DynamoDB default are PA/EL.
Gotchas
- The “C” in ACID ≠ “C” in CAP. ACID-C is integrity constraints; CAP-C is linearizability.
- Postgres durability depends on
synchronous_commitandfsync— turning either off trades durability for throughput; rarely worth it.
Q: Explain PACELC for an audit log.
An audit log demands consistency in both modes. Pick PC/EC: Postgres with synchronous replication to at least one standby. Accept higher write latency and reduced availability during partition because lost or reordered audit events are unacceptable for a regulated system.
Sources: postgresql.org transaction-iso, Abadi (PACELC, 2012), Kleppmann DDIA ch. 7–9.
Deep dive — when to pick SQL vs NoSQL
Section titled “Deep dive — when to pick SQL vs NoSQL”Pick Postgres when:
- Domain has well-defined entities with foreign-key relationships.
- You need ad-hoc queries and joins (analysts, BI, regulatory reporting).
- Strong transactional integrity required (anything financial, identity, audit).
- You can scale vertically + read replicas — Postgres on modern hardware easily handles 10s of TB and 10k+ TPS.
Pick Mongo when:
- Document is the natural unit of work and rarely joined.
- Schema varies widely across rows (sparse attributes).
- You genuinely need horizontal write scaling beyond a single node.
- Team prefers dynamic schemas and quick iteration.
A useful rule from Kleppmann (DDIA ch. 2): if your data has many-to-many relationships, the relational model usually wins because joins are first-class; if it’s a tree of one-to-many (a document with embedded line items, comments, tags) and you fetch the whole tree, documents win because there’s no join cost.
For a regulated/enterprise system: licensing/inspection records → SQL (relational, audited, joined to entities and people); document attachments/metadata → Postgres JSONB or Mongo; session store → Redis (KV); event/audit log → append-only specialized store (Postgres partitioned table or Kafka + cold store).
| Workload | Best fit | Why |
|---|---|---|
| Financial ledger | Postgres | SERIALIZABLE, FK constraints, double-entry checks |
| Product catalog with variant attrs | Postgres + JSONB or Mongo | Sparse attrs; tie-breaker is whether you join to orders/inventory |
| Event/audit log | Postgres partitioned by day, or Kafka → S3 | Append-only, time-range queries |
| Session store | Redis | TTL, sub-ms KV |
| CMS / docs | Mongo | Document model fits; rarely joined |
| Geospatial | PostGIS on Postgres | Mature spatial joins, GIST indexes |
Gotchas
- “We’ll outgrow Postgres” is almost always wrong — you’ll outgrow your indexing strategy first.
- Mongo’s “schemaless” is a lie at scale — you end up enforcing schema in app code or via JSON Schema validators.
- Joins in Mongo (
$lookup) are slow if the foreign collection isn’t indexed on the join field. - Postgres can do JSONB and relational — you don’t have to pick.
- Operational cost: managing a Mongo sharded cluster (config servers, mongos, shards) is materially harder than scaling Postgres vertically + replicas.
Q: When does NoSQL win?
When the unit of read/write is a self-contained document fetched by ID, and you need horizontal write scale beyond what one Postgres node can do. Example: IoT telemetry firehose with 100k writes/sec across many devices, each device’s recent state read independently — Mongo with a hashed shard key on deviceId distributes both reads and writes evenly.
Deep dive — JSONB vs MongoDB documents
Section titled “Deep dive — JSONB vs MongoDB documents”JSONB is Postgres’s binary-decomposed JSON type (since 9.4). Unlike json (which stores raw text and re-parses on every access), jsonb parses on insert, deduplicates keys, drops insignificant whitespace, and stores in a binary format that supports fast extraction and indexing.
Operators: -> extracts value as jsonb, ->> extracts as text, @> is containment (“does the LHS contain the RHS?”), ?/?|/?& test top-level key existence, @? and @@ evaluate jsonpath expressions.
Indexed with GIN using either:
jsonb_ops(default) — supports@>,?,?|,?&,@?,@@; larger index.jsonb_path_ops— supports only@>,@?,@@; smaller and faster but no key-existence queries.
The often-overlooked point: Postgres JSONB gives you Mongo’s flexible-schema benefit while keeping the relational engine — you can JOIN a normalized users table to a JSONB preferences column in the same query and wrap the whole thing in a SERIALIZABLE transaction.
Gotchas
jsonb_path_opsdoes not support?,?|,?&— pick the operator class to match your queries.- The
?operator only checks top-level keys.'{"a":{"b":1}}'::jsonb ? 'b'isfalse. - GIN indexes are slower to update than B-tree; very write-heavy JSONB columns can be a bottleneck. Tune
gin_pending_list_limit/fastupdate. - JSONB key order is not preserved; if you need to round-trip exact text, use
jsoninstead. - Don’t “JSONB-everything.” Columns that always exist and have a fixed type belong in proper columns — smaller rows, better stats, FK constraints.
Sources: postgresql.org datatype-json.
Deep dive — indexing
Section titled “Deep dive — indexing”Postgres ships six index access methods:
- B-tree (default) — handles
=,<,<=,>,>=,BETWEEN,IN,IS NULL, anchoredLIKE 'foo%'; can also serveORDER BYdirectly. - Hash (WAL-logged and crash-safe since PG 10) — supports only
=; rarely worth choosing over B-tree. - GIN (Generalized Inverted Index) — for composite values — JSONB, arrays, full-text
tsvector. - GiST (Generalized Search Tree) — framework for geometry, ranges, trigrams; supports nearest-neighbor
ORDER BY col <-> point. - SP-GiST — for non-balanced structures (quadtrees, tries).
- BRIN (Block Range INdex) — stores per-block min/max; tiny on disk, ideal for very large tables where indexed column correlates with physical row order (time-series).
Beyond access methods, four index shapes matter:
- Multicolumn indexes obey the leftmost-prefix rule (Markus Winand): an index on
(a,b,c)serves predicates ona,(a,b),(a,b,c)— but notborcalone. Choose the leading column to match your most common predicate, not “the most selective column” — selectivity matters only after the prefix rule lets you use the index. - Partial indexes include a
WHEREclause and only index matching rows (CREATE INDEX ON orders(id) WHERE status='open'is tiny if 99% are closed). - Expression indexes index the result of a function (
LOWER(email)). - Covering indexes with
INCLUDEadd non-key payload columns to enable index-only scans without bloating the search tree.
Gotchas
- Leftmost-prefix is real:
WHERE b=? AND c=?on(a,b,c)does not use the index unlessais also constrained. LIKE '%foo'(leading wildcard) cannot use a B-tree — needs a trigram GiST/GIN (pg_trgm).- Functional predicates need expression indexes —
WHERE LOWER(email)=?won’t use an index onemail. - GIN indexes are slow to update; bulk loads are faster if you drop & recreate.
INCLUDEcolumns aren’t searchable — they only enable index-only scans. Visibility map must be up to date (VACUUM) for index-only scans to actually skip the heap.
Q: How do you decide column order in a compound index?
Start with what’s always in the WHERE clause (the prefix has to be usable). Among those, put the equality predicates first, then the column you want to sort/range on. This is essentially Mongo’s ESR, and it’s universal — Winand calls it the leftmost-prefix rule. Selectivity only breaks ties.
Q: How would you read EXPLAIN ANALYZE on a slow query?
Top-down: (1) the access method on the leaf — Seq Scan on a big table is the first red flag; (2) Rows Removed by Filter — if huge, the index isn’t selective enough; (3) gap between estimated rows and actual rows — bad stats, run ANALYZE; (4) Buffers: shared read= vs hit= — high reads mean cache misses; (5) sort spilling to disk (external merge Disk:). Then look for missing indexes, partial-index opportunities, or rewrites that change the access path.
Sources: postgresql.org indexes-types, use-the-index-luke.com.
Deep dive — transactions with SSI
Section titled “Deep dive — transactions with SSI”Postgres uses MVCC: every row version carries xmin/xmax transaction IDs and visibility is decided per snapshot. Readers never block writers; writers never block readers.
Three offered isolation levels:
- READ COMMITTED (default) — every statement sees its own snapshot of committed data; allows nonrepeatable and phantom reads.
- REPEATABLE READ — one snapshot for the whole transaction. Postgres also blocks phantoms, exceeding the SQL standard; raises
could not serialize access due to concurrent updateon conflicts. - SERIALIZABLE using SSI (Serializable Snapshot Isolation, Cahill et al.) — adds non-blocking predicate locks (
SIReadLock) that detect read/write dependency cycles and abort one transaction withcould not serialize access due to read/write dependencies. Apps must implement retry loops.
Gotchas
- Postgres SERIALIZABLE never deadlocks but will abort with serialization failures — you must retry.
- Postgres sequences are not transactional —
nextvalincrements are not rolled back. Don’t expect contiguous IDs. INSERT ... ON CONFLICTin READ COMMITTED can affect rows not visible in the snapshot — docs explicitly call this out.
Q: Walk me through SSI.
At REPEATABLE READ + dependency tracking, Postgres watches read/write conflicts between concurrent transactions via SIReadLock predicate locks. If it detects a cycle in the dependency graph that would prevent any serial ordering, it aborts one transaction. The locks don’t block — they only mark dependencies — so SSI gives serializable correctness without the throughput hit of strict 2PL.
Deep dive — replication and partitioning
Section titled “Deep dive — replication and partitioning”Postgres offers two replication models:
- Physical (streaming) replication ships WAL byte-for-byte from primary to standbys; standbys can be warm (no queries) or hot (read-only queries). All-or-nothing (whole cluster) and replicas must be same major version.
- Logical replication (PG 10+) uses a publish/subscribe model on
(publication, subscription)pairs — table-level granularity, supports cross-version replication, foundation for CDC pipelines (Debezium reads the logical decoding stream).
Sync vs async is per-standby (synchronous_standby_names); sync gives durability (commit waits for standby ack) at the cost of latency.
Declarative partitioning (PG 10+) divides one logical table into RANGE/LIST/HASH partitions; the planner does partition pruning at plan- and execution-time. For true horizontal sharding, Citus turns Postgres into a distributed database with a coordinator + workers.
Gotchas
- Postgres logical replication does not replicate DDL, sequences, or large objects — you migrate schema separately; sequences need manual sync at cutover.
- Postgres physical streaming replication is async by default — failover can lose committed transactions. Use synchronous + at least one quorum standby for regulated workloads.
Q: How would you scale Postgres before reaching for sharding?
In order: tune queries and indexes; vertical scale (Postgres on modern NVMe + lots of RAM goes very far); add read replicas for read-heavy workloads; partition large hot tables; introduce CDC via logical replication for analytics offload; only then consider Citus or a different store. Sharding is operational complexity — defer it.
Sources: postgresql.org high-availability, postgresql.org logical-replication, postgresql.org ddl-partitioning.
Deep dive — diagnosing slow queries
Section titled “Deep dive — diagnosing slow queries”A slow query is a measurable problem, not a vibe. Disciplined path: measure → explain → fix → verify.
Step 1 — find it
Section titled “Step 1 — find it”Enable pg_stat_statements (extension) and auto_explain (loadable module).
pg_stat_statementsgives normalized query fingerprints withcalls,total_exec_time,mean_exec_time,rows,shared_blks_hit/read— sort bytotal_exec_time DESCto find queries that cost the most aggregate time. A 50ms query running 10k times/min beats a 2s query running once/min.auto_explain.log_min_duration = 500msdumps a plan for every slow execution into the log.
Step 2 — read the plan
Section titled “Step 2 — read the plan”EXPLAIN (ANALYZE, BUFFERS, VERBOSE, SETTINGS) is the flight recorder. Read bottom-up (leaves first — that’s where the work happens). Canonical red flags (Winand):
Seq Scanon a large table when the predicate should be selective.Rows Removed by Filtergreatly exceedingRowsreturned — the index is too coarse.- Gap between
rows=(estimate) andactual rows=more than ~10× — stats are stale, runANALYZEor raisedefault_statistics_target. Sort Method: external merge Disk: 218MB—work_memtoo small.Buffers: shared read=…dominatinghit=…— cold cache or table too big for shared buffers.Nested Loopwith an outer side of millions — planner thought it would be small.
Step 3 — write the next query efficiently (SQL Antipatterns, Karwin)
Section titled “Step 3 — write the next query efficiently (SQL Antipatterns, Karwin)”- Implicit Columns (
SELECT *) bloats network and breaks index-only scans; always project explicit columns. - Random Selection (
ORDER BY random() LIMIT 1on a big table) is O(N) — use TABLESAMPLE or a primary-key gap technique. - Poor Man’s Search Engine (
LIKE '%foo%') defeats B-tree indexes — usepg_trgmGIN, or full-texttsvector. - Spaghetti Query (one mega-SELECT with eight joins and three aggregates) — break into CTEs, materialize intermediates, or use
WITH … AS MATERIALIZEDin PG 12+ to force materialization. - Phantom Files (
OFFSET 1000000 LIMIT 20) — paginate by keyset (WHERE (created_at, id) < ($last_ts, $last_id) ORDER BY created_at DESC, id DESC LIMIT 20); offset re-scans every prior row.
Step 4 — verify
Section titled “Step 4 — verify”Re-run with the same parameters, compare executionTimeMillis / Total Time, and watch pg_stat_statements over a real workload window — single-shot improvements lie.
Gotchas
EXPLAINwithoutANALYZEshows estimates, not reality — always useANALYZE(it executes the query). On destructive statements wrap in a transaction andROLLBACK.pg_stat_statementsparameterizes literals —WHERE id=1andWHERE id=2share one row. Checkpg_stat_statements.track = alland that the extension is inshared_preload_libraries.ANALYZE(statistics) ≠ANALYZE(the EXPLAIN flag). First refreshes planner stats; second executes the plan.- Plan cache poisoning — a prepared statement that picked a generic plan for skewed data may stay bad. Postgres 14+ improves this; in PG ≤13 set
plan_cache_mode = force_custom_planfor outliers.
Q: You have a query that’s been getting slower over six months. What do you do?
Pull pg_stat_statements to see if mean_exec_time drifted (data growth, plan flip) vs calls climbed (traffic). Run EXPLAIN (ANALYZE, BUFFERS) and compare against a known-good prior plan if you keep them. Check pg_stat_user_tables.n_dead_tup and last_autovacuum — bloat from missed vacuums is a frequent culprit on update-heavy tables. Look for index drift: an index that was selective at 1M rows may now match 30% of the table. Fixes layer: ANALYZE, VACUUM, rewrite query, add/replace index, partition.
Q: How do you paginate 50 million rows?
Never OFFSET. Use keyset/seek pagination: index on (created_at DESC, id DESC), page boundary carries the last (ts, id) tuple. Constant time per page regardless of depth. Trade-off: no random jumps to “page 1000” — but real users page sequentially. (Winand’s SQL Performance Explained.)
Sources: Karwin SQL Antipatterns ch. 18–20, Winand SQL Performance Explained, postgresql.org pg_stat_statements, postgresql.org auto-explain.
Deep dive — schema design (SQL Antipatterns)
Section titled “Deep dive — schema design (SQL Antipatterns)”The pragmatic working rule from DDIA ch. 2: normalize until queries hurt, then denormalize deliberately with a plan to keep redundant copies consistent (usually via triggers, application code, or CDC).
Denormalization trades write cost (and risk of drift) for read performance. Two principled forms:
- Materialized views (Postgres
MATERIALIZED VIEW— refreshed manually or viaREFRESH MATERIALIZED VIEW CONCURRENTLY). - Derived stores (project an event stream into Elasticsearch for search, Redis for hot reads, a wide column store for analytics).
DDIA argues this is the only honest way to denormalize at scale — treat the log as source of truth and rebuild derived views deterministically.
Karwin’s SQL Antipatterns to avoid
Section titled “Karwin’s SQL Antipatterns to avoid”- Jaywalking — comma-separated lists in a column (
tags = 'red,green,blue'). Can’t index individual values, can’t FK them, can’t enforce uniqueness. Fix: junction table. Postgres exception:text[]with a GIN index is acceptable when list is small and never joined. - Naive Trees (Adjacency List) —
parent_idself-reference is fine for small trees but hellish for “all descendants.” Alternatives: Path Enumeration (path = '/1/4/9/'), Nested Sets, Closure Table (best general-purpose). PG-specific:ltreeextension is purpose-built for hierarchies with GiST indexes. - Entity-Attribute-Value (EAV) —
entity_id,attribute_name,valuerows. Disastrous for type safety, NULL handling, query complexity. In Postgres, use JSONB instead. - Polymorphic Associations — one FK column pointing at multiple parent tables. Breaks referential integrity. Fix: one junction table per parent type, or one
commentstable per parent. - Multicolumn Attributes —
tag1, tag2, tag3, tag4columns. Same problem as Jaywalking. Fix: junction table. - Metadata Tribbles — table-per-period (
orders_2024_q1). Use declarative partitioning instead. - ID Required — surrogate
ideverywhere even when a natural key exists. Often fine, but always declare the natural unique constraint asUNIQUEeven if you also keepid. - Keyless Entry — skipping foreign keys “for performance.” FKs cost almost nothing on modern hardware and they’re the only thing preventing orphan rows.
Schema evolution
Section titled “Schema evolution”Online migrations on a hot table need care:
ALTER TABLE … ADD COLUMNwith a non-volatile DEFAULT is metadata-only since PG 11 (free).- Adding
NOT NULLrequires a full table rewrite unless paired withDEFAULT. - Adding an index in production: always
CREATE INDEX CONCURRENTLY(noAccessExclusiveLock, slower, can’t be inside a transaction). - For type changes and renames, the safe pattern is expand–migrate–contract: add new column, dual-write, backfill, switch reads, drop old.
Q: How would you model a hierarchical org chart in Postgres?
For depths ≤ ~6 levels: adjacency list + WITH RECURSIVE (simple, fits the relational model). For deeper, frequent ancestor/descendant queries: closure table — separate table of (ancestor, descendant, depth) updated by triggers. PG shortcut: the ltree extension with GiST indexes — purpose-built, well-tuned, and queries read naturally (path <@ 'top.org.team').
Q: When is denormalization the right answer?
When the read pattern is fixed, hot, and the join cost dominates the workload — after you’ve exhausted indexing and query rewrites. Pair the denormalization with a strict update path (trigger, CDC, or rebuild from event log) so you can prove no drift.
Sources: Karwin SQL Antipatterns ch. 1–9, Kleppmann DDIA ch. 2, 11–12.
Deep dive — storage engine algorithms (B-tree, LSM, Bloom, HLL)
Section titled “Deep dive — storage engine algorithms (B-tree, LSM, Bloom, HLL)”B-trees and B+-trees (Bayer & McCreight, 1972). A balanced multi-way search tree where each node spans a disk page (typically 8 KB in Postgres). Search, insert, delete are O(log_B N) where B is fan-out (often 100–500), so a 4-level tree indexes hundreds of millions of rows with at most 4 page reads. B+-trees (used by Postgres, MySQL InnoDB, Oracle) keep all values in leaves and link leaves into a doubly linked list — what makes range scans cheap. Updates use page splits when a leaf overflows; this is where index bloat comes from on update-heavy workloads. Postgres mitigates with HOT updates (heap-only tuple — same page, no index update) when no indexed column changed, and with deduplication in PG 13+.
LSM-trees and SSTables (O’Neil 1996; SSTable from Bigtable paper 2006). Used by Cassandra, RocksDB, LevelDB, ScyllaDB, MongoDB’s WiredTiger (LSM mode), HBase. Writes go to an in-memory memtable + a WAL. When the memtable is full it’s flushed to disk as an immutable SSTable. Background compaction merges SSTables (size-tiered or leveled). Reads check memtable → recent SSTables → older SSTables, often skipping files via Bloom filters. Trade-off vs B-tree: higher write throughput at the cost of read amplification + space amplification during compaction. DDIA chart: B-trees for read-optimized, LSM for write-heavy.
Bloom filters (Bloom, 1970). A probabilistic bit array with k hash functions. add(x) sets k bits; contains(x) checks those k bits. Zero false negatives, tunable FP rate (~1% with ~10 bits/element and 7 hashes). Every LSM engine uses them to skip SSTable I/O. Postgres ships a bloom index access method (extension) — useful for “any-of-N equality columns” filters.
HyperLogLog (Flajolet et al., 2007). A streaming cardinality estimator. ~12 KB of memory → cardinality up to ~10⁹ at ~0.81% standard error. Implementations: Redis PFADD/PFCOUNT/PFMERGE, Postgres postgresql-hll extension, ClickHouse uniqHLL12. Right call when you need how many distinct but not which ones.
Skip lists, tries, count-min sketch, T-digest worth a mention: Redis ZSETs are skip lists (O(log N) ordered structure, simpler than balanced tree, easier to make lock-free); tries underpin some autocomplete indexes and ltree; count-min sketch estimates frequency the way HLL estimates cardinality; T-digest approximates quantiles (p50/p99) in bounded memory.
Q: Why is Postgres a B-tree shop while Cassandra is LSM?
Postgres optimizes for transactional read/write mixes with ordered range queries — exactly B-trees’ sweet spot. Cassandra targets very high write throughput on commodity disks where random writes would melt the hardware; LSM converts every write to a sequential append, batches them, merges in the background. DDIA framing: B-trees for balanced read/write, LSM for write-heavy.
Q: When would you use a Bloom filter in application code?
Any “is this maybe in the set, cheaply?” boundary. The CDN check before a costly origin fetch; the “have we seen this URL?” check in a crawler; the “is this email in a leaked-credentials dataset?” check. Inside the database you get them for free in LSM engines and as Postgres’s bloom extension.
Sources: Kleppmann DDIA ch. 3, Petrov Database Internals ch. 1–7, Flajolet et al. HyperLogLog (2007), Bloom (1970).