Skip to content

Distributed Systems — Basics

Multiple machines, communicating over an unreliable network, must coordinate to solve a problem. Failures are partial — some nodes fail, network drops, clocks drift — and you can’t always tell from the outside.

  1. The network is reliable.
  2. Latency is zero.
  3. Bandwidth is infinite.
  4. The network is secure.
  5. Topology doesn’t change.
  6. There is one administrator.
  7. Transport cost is zero.
  8. The network is homogeneous.

Every design must assume these are false.

  • No global clock — wall clocks drift, NTP corrects, but always slightly off.
  • Logical clocks: Lamport timestamps (per-event counters), vector clocks (per-node array).
  • Hybrid logical clocks — combine wall + logical for causal ordering.
  • Don’t use timestamp ordering for correctness across machines (clock skew).

Under network Partition you must pick:

  • C — every read sees latest write (linearizable).
  • A — every request gets a response (success/error, not hang).

Real systems navigate the trade-off:

  • CP: refuses to serve when partitioned (etcd, ZooKeeper, MongoDB w/ majority writes).
  • AP: keeps serving, may return stale (Cassandra, DynamoDB w/ eventual reads, Riak).

CAP is binary; reality is shades of grey. PACELC extends: even when no partition, choose between Latency and Consistency.

Consistency models (strongest → weakest)

Section titled “Consistency models (strongest → weakest)”
  • Linearizable — operations appear to happen in some real-time order, atomic.
  • Sequential — all nodes see same order, but not necessarily real-time.
  • Causal — operations causally related preserve order.
  • Read-your-writes — your own writes immediately visible to you.
  • Monotonic reads — reads don’t go back in time.
  • Eventual — converges if writes stop.
  • Leader-based (single-leader): writes go to leader, replicate to followers. Simple, easy to reason.
  • Multi-leader: multiple write nodes, requires conflict resolution.
  • Leaderless (Dynamo-style): client writes to N replicas, reads from N replicas, uses quorums (W+R > N).
  • Sync vs async replication: durability vs latency.

Divide data across nodes:

  • Range — by key range. Risk: hot range. Used: HBase, BigTable.
  • Hash — by hash(key). Even distribution. Range scans difficult. Used: Cassandra, DynamoDB.
  • Consistent hashing — minimal reshuffling on node add/remove. Used: Dynamo, Riak.
  • Composite — hash for partition + range within (Cassandra clustering keys).

Rebalancing strategy: fixed N partitions × node count vs dynamic.

  • Crash-stop — node fails permanently, never returns.
  • Crash-recovery — restarts and rejoins (with possibly stale state).
  • Byzantine — node lies / corrupts. Rare in trusted internal clusters; matters for blockchain, cross-org systems.
  • Network partition — split-brain.
  • Slow node (“gray failure”) — alive but laggy. Often worse than crash because timeouts don’t trigger.

Multiple nodes must agree on a value (leader, log entry, configuration).

  • Paxos — Lamport’s classic. Provably correct, infamously hard to implement.
  • Raft — simpler, modular: leader election, log replication, safety. Most modern systems use Raft.
  • ZAB (ZooKeeper Atomic Broadcast) — total-order broadcast, similar guarantees to Paxos.
  • Multi-Paxos, EPaxos, Raft-like variants.

In quorum-based consensus: majority needed. Tolerates f failures with 2f+1 nodes.

Real systems using Raft: etcd (Kubernetes), Consul, CockroachDB, TiKV, RethinkDB.

  • Quorum — minimum number of nodes that must respond.
  • Heartbeat — periodic liveness signal.
  • Gossip protocol — randomized peer-to-peer state propagation (Cassandra, Consul).
  • Anti-entropy — periodic background sync to resolve divergence.
  • Read repair — fix divergence on read by writing newest value back to lagging replicas.
  • Hinted handoff — buffer writes for an offline replica, replay on recovery.
  • 2PC (two-phase commit) — prepare + commit. Blocking on coordinator failure.
  • 3PC — adds extra phase to avoid blocking. Rarely used.
  • Saga — sequence of local txns + compensations. Eventual consistency.
  • Calvin / deterministic — preorder transactions, execute deterministically.
  1. CAP — pick one and design accordingly.
  2. Implement leader election sketch (Raft basics).
  3. Why is exactly-once delivery hard? — duplicate detection at app layer; networks can drop or delay acks.
  4. Two-generals problem — provably impossible to agree over unreliable channel without some compromise.
  5. How does ZooKeeper / etcd guarantee linearizability? Quorum write + leader read with read-index.
  6. Strong vs eventual — give product examples.