Distributed Systems — Practical

Distributed Systems — Practical patterns

Designing for partial failure

Every remote call must answer:

What’s the timeout?
What’s the retry policy (count, backoff, jitter, idempotency)?
What’s the fallback? (default value, cached value, degraded mode, error to user)
Is this call inside a circuit breaker?
Is this call cancellable (deadline propagation)?

Distributed lock with fencing token (etcd)

// Acquire lease + lock; fencing token = revision
session, _ := concurrency.NewSession(client, concurrency.WithTTL(30))
mu := concurrency.NewMutex(session, "/locks/job-X")
mu.Lock(ctx)
fencing := mu.Header().Revision  // monotonic increasing

// Use fencing in downstream:
// downstream rejects writes with fencing <= last seen
db.Exec("UPDATE state SET v=$1, last_token=$2 WHERE last_token < $2", v, fencing)

mu.Unlock(ctx)

Lock alone is not enough — process can pause (GC, swap, OS scheduling) and another process acquires lock. Old process wakes up, still thinks it has lock, mutates state. Fencing prevents this.

Idempotency via inbox table

CREATE TABLE inbox (
  message_id  UUID PRIMARY KEY,
  received_at TIMESTAMPTZ NOT NULL DEFAULT now()
);

async function handle(msg) {
  await db.tx(async t => {
    const r = await t.query(
      `INSERT INTO inbox (message_id) VALUES ($1) ON CONFLICT DO NOTHING RETURNING 1`,
      [msg.id]);
    if (!r.rowCount) return;
    await applyEffect(t, msg);
  });
}

Hedged requests (reduce tail latency)

async function hedged<T>(call: () => Promise<T>, hedgeAfterMs: number): Promise<T> {
  const first = call();
  const hedge = new Promise<T>((resolve, reject) => {
    setTimeout(() => call().then(resolve, reject), hedgeAfterMs);
  });
  return Promise.race([first, hedge]);
}

Consistent hashing (sketch)

class HashRing {
  private ring = new Map<number, string>(); // hash → node
  private sortedHashes: number[] = [];

  add(node: string, virtual = 100) {
    for (let i = 0; i < virtual; i++) {
      const h = hash(`${node}#${i}`);
      this.ring.set(h, node);
    }
    this.sortedHashes = [...this.ring.keys()].sort((a, b) => a - b);
  }

  pick(key: string): string {
    const h = hash(key);
    const idx = this.sortedHashes.findIndex(x => x >= h);
    return this.ring.get(this.sortedHashes[idx === -1 ? 0 : idx])!;
  }
}

Use virtual nodes (100+ per physical) to smooth distribution.

Time-based dedup (sliding window)

For event streams where exact dedup table would grow forever, keep a TTL’d Bloom filter or LRU. Acceptable false-positive rate in exchange for bounded memory.

Outbox + relay (cross-system reliability)

-- in producer DB transaction
BEGIN;
UPDATE orders SET status='paid' WHERE id=$1;
INSERT INTO outbox(id, event_type, payload) VALUES (gen_random_uuid(), 'OrderPaid', $2);
COMMIT;

Relay (separate worker / Debezium) reads outbox and publishes to broker. Idempotent consumer dedupes.

Health/readiness for orchestrators

liveness — am I alive? If false, restart me.
readiness — should I receive traffic now? If false, drain.
startup — initial probe with longer grace (K8s).

Don’t conflate them. A loaded service is alive but might be unready.

Backoff with jitter (essential)

const sleep = (base: number, attempt: number) =>
  base * 2 ** attempt * (0.5 + Math.random() * 0.5);

“Full jitter” is a recommended variant: Math.random() * base * 2**attempt.

Without jitter, retries synchronize → thundering herd.

Distributed tracing — what to record per span

Service name, span name, kind (server/client/internal/producer/consumer).
Start + end time.
trace_id, span_id, parent_span_id.
Attributes: HTTP method/status, DB statement (sanitized), entity ids, user id, region.
Events: error, retry, cache hit/miss.

Propagate via traceparent header (W3C Trace Context).

Useful “back of envelope” numbers

Op	Time
L1 cache	0.5 ns
Branch mispredict	5 ns
Memory access	100 ns
SSD random read	100 µs
Round trip same DC	0.5 ms
Disk seek (HDD)	5-10 ms
Round trip same continent	50 ms
Round trip cross-Atlantic	150 ms
HTTP request to typical API	100-300 ms

(From Jeff Dean’s “Numbers Every Engineer Should Know”.)

Common production issues & fixes

Symptom	Cause	Fix
Spike in p99 only	Tail latency in one shard	Hedging, replication
Cascading 500s	No timeout on downstream	Per-call deadline
OOM after deploy	Connection leak	Pool with max + close in finally
Slow recovery from outage	Retry storm	Jitter + breaker
Duplicate orders	At-least-once + non-idempotent	Idempotency key
Phantom split-brain writes	Lock without fencing	Fencing token

Tools to know

etcd / ZooKeeper / Consul — coordination.
Temporal / Cadence — durable workflows.
Vitess / CockroachDB / TiDB / YugabyteDB — distributed SQL.
Cassandra / ScyllaDB — eventually consistent KV at scale.
OpenTelemetry — tracing/metrics/logs standard.
Jepsen — testing distributed systems for consistency violations.