Distributed Systems — Theory
Distributed Systems — Theory (interview deep-dive)
Section titled “Distributed Systems — Theory (interview deep-dive)”Raft in a nutshell (the part you must know)
Section titled “Raft in a nutshell (the part you must know)”Raft splits consensus into 3 sub-problems: leader election, log replication, safety.
- Leader — receives client commands, replicates log entries.
- Follower — passive; respond to leader/candidate RPCs.
- Candidate — trying to become leader.
Monotonic counter. Each term has at most one leader. Stale messages with old terms are ignored.
Leader election
Section titled “Leader election”- Followers expect heartbeats from leader. On timeout, become candidate, increment term, vote for self, request votes.
- Win when majority votes received. Else timeout → another election.
- Raft restricts: only candidates with up-to-date logs can win (prevents data loss).
Log replication
Section titled “Log replication”- Client → leader → leader appends to its log → replicates to followers via
AppendEntriesRPC. - Once majority acks, entry is committed → leader applies to state machine, replies to client.
- Followers eventually apply. On leader change, new leader uses log overlap to converge followers.
Safety guarantees
Section titled “Safety guarantees”- Leader’s log contains all committed entries.
- Once an entry is committed, it stays committed across leader changes.
Paxos vs Raft (interview answer)
Section titled “Paxos vs Raft (interview answer)”Both are equivalent in correctness under same failure model (fail-stop, async network). Raft is easier to understand and implement because:
- Only the leader writes. No multi-decree complexity.
- Log entries committed in order.
- Explicit phases (election, replication).
Many real systems use Raft (etcd, Consul, CockroachDB). Some use Paxos (Spanner uses a variant). Don’t try to explain Paxos step-by-step in an interview; mention it as the foundation, then describe Raft.
Quorums
Section titled “Quorums”For N replicas:
- W = write quorum, R = read quorum.
- W + R > N ensures any read sees the latest write.
- W > N/2 ensures only one write quorum at a time.
Cassandra exposes W and R as consistency_level per query (ONE, QUORUM, ALL).
Two-generals problem
Section titled “Two-generals problem”Two armies must coordinate attack via unreliable messengers. Cannot guarantee agreement with unreliable channel — every ack itself can be lost.
Practical implication: exactly-once delivery is impossible without app-layer dedupe. You always have at-least-once with a deduplication key.
FLP impossibility
Section titled “FLP impossibility”Fischer, Lynch, Paterson (1985): in fully async system with one possibly-faulty node, deterministic consensus is impossible.
Real systems work around:
- Partial synchrony assumption (eventually network behaves).
- Randomization (random timeouts in Raft elections).
- Failure detectors (timeouts decide who’s “dead”).
Linearizability vs serializability
Section titled “Linearizability vs serializability”Often confused.
- Linearizability — about operations on a single object appearing atomic and respecting real-time order.
- Serializability — about transactions over multiple objects equivalent to some serial order. Doesn’t require real-time.
- Strict serializability — both. Hardest to provide.
Postgres at Serializable: serializable but not linearizable across replicas.
Consensus is fundamental to
Section titled “Consensus is fundamental to”- Distributed locks — fenced via consensus (etcd, ZooKeeper).
- Configuration store — same.
- Leader election for replicated services.
- Distributed transactions (Spanner via Paxos groups, CockroachDB via Raft).
- Atomic commit across shards (2PC built atop consensus per shard).
Idempotency & message delivery
Section titled “Idempotency & message delivery”Networks deliver:
- At most once — drop on network loss; never duplicate.
- At least once — retry until ack; may duplicate.
- Exactly once — only via dedupe at receiver.
Design APIs/consumers to be idempotent. Dedupe by message id stored in receiver’s DB (inbox).
Replication conflict resolution (multi-leader / leaderless)
Section titled “Replication conflict resolution (multi-leader / leaderless)”- Last-write-wins (LWW) — based on timestamp. Lossy under clock skew.
- CRDT (Conflict-free Replicated Data Type) — math-defined merge: G-counter, OR-set, RGA.
- Operational transforms (OT) — used in collaborative editors (Google Docs).
- Per-app merge — application chooses (e.g., shopping cart unions items).
Latency / availability math
Section titled “Latency / availability math”- 3-9s = 99.9% (8.7h downtime/year).
- 4-9s = 99.99% (53min/year).
- 5-9s = 99.999% (5.3min/year).
- Availability of chained services multiplies: A × B × C. Degraded modes / async paths help.
Fan-out / scatter-gather
Section titled “Fan-out / scatter-gather”Reading from many shards to assemble result:
- Tail latency dominates: response time = max of all shards’ response times.
- Mitigation: hedged requests (send second request to slow shard after p95 deadline), request reduction, timeout + partial result.
Jeff Dean’s “Achieving Rapid Response Times in Large Online Services” — classic reading.
Common interview questions
Section titled “Common interview questions”- Design a distributed lock (use etcd lease + fencing tokens).
- Implement a counter that handles 1M ops/sec across regions (CRDT G-counter).
- How does Spanner achieve external consistency globally? TrueTime API + Paxos groups + commit-wait.
- Describe split-brain and how to prevent. Quorum + fencing tokens (monotonic increasing token from coordinator).
- Explain why dual-write between DB and cache/broker fails atomicity. Outbox.
- Why do retries multiply load? Backoff + jitter + breaker.
- Hot key in a distributed cache — what to do? L1 cache layer, key sharding, request coalescing.
Recommended reading (for prep)
Section titled “Recommended reading (for prep)”- Designing Data-Intensive Applications — Kleppmann.
- Database Internals — Petrov.
- Raft paper “In Search of an Understandable Consensus Algorithm”.
- Google Spanner paper.
- Jeff Dean’s tail latency papers.