Skip to content

Distributed Systems — Theory

Distributed Systems — Theory (interview deep-dive)

Section titled “Distributed Systems — Theory (interview deep-dive)”

Raft in a nutshell (the part you must know)

Section titled “Raft in a nutshell (the part you must know)”

Raft splits consensus into 3 sub-problems: leader election, log replication, safety.

  • Leader — receives client commands, replicates log entries.
  • Follower — passive; respond to leader/candidate RPCs.
  • Candidate — trying to become leader.

Monotonic counter. Each term has at most one leader. Stale messages with old terms are ignored.

  • Followers expect heartbeats from leader. On timeout, become candidate, increment term, vote for self, request votes.
  • Win when majority votes received. Else timeout → another election.
  • Raft restricts: only candidates with up-to-date logs can win (prevents data loss).
  • Client → leader → leader appends to its log → replicates to followers via AppendEntries RPC.
  • Once majority acks, entry is committed → leader applies to state machine, replies to client.
  • Followers eventually apply. On leader change, new leader uses log overlap to converge followers.
  • Leader’s log contains all committed entries.
  • Once an entry is committed, it stays committed across leader changes.

Both are equivalent in correctness under same failure model (fail-stop, async network). Raft is easier to understand and implement because:

  • Only the leader writes. No multi-decree complexity.
  • Log entries committed in order.
  • Explicit phases (election, replication).

Many real systems use Raft (etcd, Consul, CockroachDB). Some use Paxos (Spanner uses a variant). Don’t try to explain Paxos step-by-step in an interview; mention it as the foundation, then describe Raft.

For N replicas:

  • W = write quorum, R = read quorum.
  • W + R > N ensures any read sees the latest write.
  • W > N/2 ensures only one write quorum at a time.

Cassandra exposes W and R as consistency_level per query (ONE, QUORUM, ALL).

Two armies must coordinate attack via unreliable messengers. Cannot guarantee agreement with unreliable channel — every ack itself can be lost.

Practical implication: exactly-once delivery is impossible without app-layer dedupe. You always have at-least-once with a deduplication key.

Fischer, Lynch, Paterson (1985): in fully async system with one possibly-faulty node, deterministic consensus is impossible.

Real systems work around:

  • Partial synchrony assumption (eventually network behaves).
  • Randomization (random timeouts in Raft elections).
  • Failure detectors (timeouts decide who’s “dead”).

Often confused.

  • Linearizability — about operations on a single object appearing atomic and respecting real-time order.
  • Serializability — about transactions over multiple objects equivalent to some serial order. Doesn’t require real-time.
  • Strict serializability — both. Hardest to provide.

Postgres at Serializable: serializable but not linearizable across replicas.

  • Distributed locks — fenced via consensus (etcd, ZooKeeper).
  • Configuration store — same.
  • Leader election for replicated services.
  • Distributed transactions (Spanner via Paxos groups, CockroachDB via Raft).
  • Atomic commit across shards (2PC built atop consensus per shard).

Networks deliver:

  • At most once — drop on network loss; never duplicate.
  • At least once — retry until ack; may duplicate.
  • Exactly once — only via dedupe at receiver.

Design APIs/consumers to be idempotent. Dedupe by message id stored in receiver’s DB (inbox).

Replication conflict resolution (multi-leader / leaderless)

Section titled “Replication conflict resolution (multi-leader / leaderless)”
  • Last-write-wins (LWW) — based on timestamp. Lossy under clock skew.
  • CRDT (Conflict-free Replicated Data Type) — math-defined merge: G-counter, OR-set, RGA.
  • Operational transforms (OT) — used in collaborative editors (Google Docs).
  • Per-app merge — application chooses (e.g., shopping cart unions items).
  • 3-9s = 99.9% (8.7h downtime/year).
  • 4-9s = 99.99% (53min/year).
  • 5-9s = 99.999% (5.3min/year).
  • Availability of chained services multiplies: A × B × C. Degraded modes / async paths help.

Reading from many shards to assemble result:

  • Tail latency dominates: response time = max of all shards’ response times.
  • Mitigation: hedged requests (send second request to slow shard after p95 deadline), request reduction, timeout + partial result.

Jeff Dean’s “Achieving Rapid Response Times in Large Online Services” — classic reading.

  1. Design a distributed lock (use etcd lease + fencing tokens).
  2. Implement a counter that handles 1M ops/sec across regions (CRDT G-counter).
  3. How does Spanner achieve external consistency globally? TrueTime API + Paxos groups + commit-wait.
  4. Describe split-brain and how to prevent. Quorum + fencing tokens (monotonic increasing token from coordinator).
  5. Explain why dual-write between DB and cache/broker fails atomicity. Outbox.
  6. Why do retries multiply load? Backoff + jitter + breaker.
  7. Hot key in a distributed cache — what to do? L1 cache layer, key sharding, request coalescing.
  • Designing Data-Intensive Applications — Kleppmann.
  • Database Internals — Petrov.
  • Raft paper “In Search of an Understandable Consensus Algorithm”.
  • Google Spanner paper.
  • Jeff Dean’s tail latency papers.