Distributed Systems — Basics
Distributed Systems — Basics
Section titled “Distributed Systems — Basics”What makes a system “distributed”
Section titled “What makes a system “distributed””Multiple machines, communicating over an unreliable network, must coordinate to solve a problem. Failures are partial — some nodes fail, network drops, clocks drift — and you can’t always tell from the outside.
The 8 fallacies of distributed computing
Section titled “The 8 fallacies of distributed computing”- The network is reliable.
- Latency is zero.
- Bandwidth is infinite.
- The network is secure.
- Topology doesn’t change.
- There is one administrator.
- Transport cost is zero.
- The network is homogeneous.
Every design must assume these are false.
- No global clock — wall clocks drift, NTP corrects, but always slightly off.
- Logical clocks: Lamport timestamps (per-event counters), vector clocks (per-node array).
- Hybrid logical clocks — combine wall + logical for causal ordering.
- Don’t use timestamp ordering for correctness across machines (clock skew).
CAP theorem
Section titled “CAP theorem”Under network Partition you must pick:
- C — every read sees latest write (linearizable).
- A — every request gets a response (success/error, not hang).
Real systems navigate the trade-off:
- CP: refuses to serve when partitioned (etcd, ZooKeeper, MongoDB w/ majority writes).
- AP: keeps serving, may return stale (Cassandra, DynamoDB w/ eventual reads, Riak).
CAP is binary; reality is shades of grey. PACELC extends: even when no partition, choose between Latency and Consistency.
Consistency models (strongest → weakest)
Section titled “Consistency models (strongest → weakest)”- Linearizable — operations appear to happen in some real-time order, atomic.
- Sequential — all nodes see same order, but not necessarily real-time.
- Causal — operations causally related preserve order.
- Read-your-writes — your own writes immediately visible to you.
- Monotonic reads — reads don’t go back in time.
- Eventual — converges if writes stop.
Replication strategies
Section titled “Replication strategies”- Leader-based (single-leader): writes go to leader, replicate to followers. Simple, easy to reason.
- Multi-leader: multiple write nodes, requires conflict resolution.
- Leaderless (Dynamo-style): client writes to N replicas, reads from N replicas, uses quorums (W+R > N).
- Sync vs async replication: durability vs latency.
Partitioning (sharding)
Section titled “Partitioning (sharding)”Divide data across nodes:
- Range — by key range. Risk: hot range. Used: HBase, BigTable.
- Hash — by hash(key). Even distribution. Range scans difficult. Used: Cassandra, DynamoDB.
- Consistent hashing — minimal reshuffling on node add/remove. Used: Dynamo, Riak.
- Composite — hash for partition + range within (Cassandra clustering keys).
Rebalancing strategy: fixed N partitions × node count vs dynamic.
Failure modes
Section titled “Failure modes”- Crash-stop — node fails permanently, never returns.
- Crash-recovery — restarts and rejoins (with possibly stale state).
- Byzantine — node lies / corrupts. Rare in trusted internal clusters; matters for blockchain, cross-org systems.
- Network partition — split-brain.
- Slow node (“gray failure”) — alive but laggy. Often worse than crash because timeouts don’t trigger.
Consensus
Section titled “Consensus”Multiple nodes must agree on a value (leader, log entry, configuration).
- Paxos — Lamport’s classic. Provably correct, infamously hard to implement.
- Raft — simpler, modular: leader election, log replication, safety. Most modern systems use Raft.
- ZAB (ZooKeeper Atomic Broadcast) — total-order broadcast, similar guarantees to Paxos.
- Multi-Paxos, EPaxos, Raft-like variants.
In quorum-based consensus: majority needed. Tolerates f failures with 2f+1 nodes.
Real systems using Raft: etcd (Kubernetes), Consul, CockroachDB, TiKV, RethinkDB.
Common terminology
Section titled “Common terminology”- Quorum — minimum number of nodes that must respond.
- Heartbeat — periodic liveness signal.
- Gossip protocol — randomized peer-to-peer state propagation (Cassandra, Consul).
- Anti-entropy — periodic background sync to resolve divergence.
- Read repair — fix divergence on read by writing newest value back to lagging replicas.
- Hinted handoff — buffer writes for an offline replica, replay on recovery.
Distributed transactions
Section titled “Distributed transactions”- 2PC (two-phase commit) — prepare + commit. Blocking on coordinator failure.
- 3PC — adds extra phase to avoid blocking. Rarely used.
- Saga — sequence of local txns + compensations. Eventual consistency.
- Calvin / deterministic — preorder transactions, execute deterministically.
Common interview Qs
Section titled “Common interview Qs”- CAP — pick one and design accordingly.
- Implement leader election sketch (Raft basics).
- Why is exactly-once delivery hard? — duplicate detection at app layer; networks can drop or delay acks.
- Two-generals problem — provably impossible to agree over unreliable channel without some compromise.
- How does ZooKeeper / etcd guarantee linearizability? Quorum write + leader read with read-index.
- Strong vs eventual — give product examples.