Distributed Systems — Theory

Distributed Systems — Theory (interview deep-dive)

Raft in a nutshell (the part you must know)

Raft splits consensus into 3 sub-problems: leader election, log replication, safety.

Roles

Leader — receives client commands, replicates log entries.
Follower — passive; respond to leader/candidate RPCs.
Candidate — trying to become leader.

Term

Monotonic counter. Each term has at most one leader. Stale messages with old terms are ignored.

Leader election

Followers expect heartbeats from leader. On timeout, become candidate, increment term, vote for self, request votes.
Win when majority votes received. Else timeout → another election.
Raft restricts: only candidates with up-to-date logs can win (prevents data loss).

Log replication

Client → leader → leader appends to its log → replicates to followers via AppendEntries RPC.
Once majority acks, entry is committed → leader applies to state machine, replies to client.
Followers eventually apply. On leader change, new leader uses log overlap to converge followers.

Safety guarantees

Leader’s log contains all committed entries.
Once an entry is committed, it stays committed across leader changes.

Paxos vs Raft (interview answer)

Both are equivalent in correctness under same failure model (fail-stop, async network). Raft is easier to understand and implement because:

Only the leader writes. No multi-decree complexity.
Log entries committed in order.
Explicit phases (election, replication).

Many real systems use Raft (etcd, Consul, CockroachDB). Some use Paxos (Spanner uses a variant). Don’t try to explain Paxos step-by-step in an interview; mention it as the foundation, then describe Raft.

Quorums

For N replicas:

W = write quorum, R = read quorum.
W + R > N ensures any read sees the latest write.
W > N/2 ensures only one write quorum at a time.

Cassandra exposes W and R as consistency_level per query (ONE, QUORUM, ALL).

Two-generals problem

Two armies must coordinate attack via unreliable messengers. Cannot guarantee agreement with unreliable channel — every ack itself can be lost.

Practical implication: exactly-once delivery is impossible without app-layer dedupe. You always have at-least-once with a deduplication key.

FLP impossibility

Fischer, Lynch, Paterson (1985): in fully async system with one possibly-faulty node, deterministic consensus is impossible.

Real systems work around:

Partial synchrony assumption (eventually network behaves).
Randomization (random timeouts in Raft elections).
Failure detectors (timeouts decide who’s “dead”).

Linearizability vs serializability

Often confused.

Linearizability — about operations on a single object appearing atomic and respecting real-time order.
Serializability — about transactions over multiple objects equivalent to some serial order. Doesn’t require real-time.
Strict serializability — both. Hardest to provide.

Postgres at Serializable: serializable but not linearizable across replicas.

Consensus is fundamental to

Distributed locks — fenced via consensus (etcd, ZooKeeper).
Configuration store — same.
Leader election for replicated services.
Distributed transactions (Spanner via Paxos groups, CockroachDB via Raft).
Atomic commit across shards (2PC built atop consensus per shard).

Idempotency & message delivery

Networks deliver:

At most once — drop on network loss; never duplicate.
At least once — retry until ack; may duplicate.
Exactly once — only via dedupe at receiver.

Design APIs/consumers to be idempotent. Dedupe by message id stored in receiver’s DB (inbox).

Replication conflict resolution (multi-leader / leaderless)

Last-write-wins (LWW) — based on timestamp. Lossy under clock skew.
CRDT (Conflict-free Replicated Data Type) — math-defined merge: G-counter, OR-set, RGA.
Operational transforms (OT) — used in collaborative editors (Google Docs).
Per-app merge — application chooses (e.g., shopping cart unions items).

Latency / availability math

3-9s = 99.9% (8.7h downtime/year).
4-9s = 99.99% (53min/year).
5-9s = 99.999% (5.3min/year).
Availability of chained services multiplies: A × B × C. Degraded modes / async paths help.

Fan-out / scatter-gather

Reading from many shards to assemble result:

Tail latency dominates: response time = max of all shards’ response times.
Mitigation: hedged requests (send second request to slow shard after p95 deadline), request reduction, timeout + partial result.

Jeff Dean’s “Achieving Rapid Response Times in Large Online Services” — classic reading.

Common interview questions

Design a distributed lock (use etcd lease + fencing tokens).
Implement a counter that handles 1M ops/sec across regions (CRDT G-counter).
How does Spanner achieve external consistency globally? TrueTime API + Paxos groups + commit-wait.
Describe split-brain and how to prevent. Quorum + fencing tokens (monotonic increasing token from coordinator).
Explain why dual-write between DB and cache/broker fails atomicity. Outbox.
Why do retries multiply load? Backoff + jitter + breaker.
Hot key in a distributed cache — what to do? L1 cache layer, key sharding, request coalescing.