System Design — Practical

System Design — Practical (worked examples)

Companion to theory.md. Theory = concepts; this doc = how to defend each choice under interview probing. Every example follows: scope → capacity math → API → data model → read/write path → bottlenecks → failure modes → observability.

1. URL shortener

Requirements

Functional: create short URL; redirect; expire; custom slug (optional); per-user analytics (clicks, geo, referrer).
Non-functional: 100M URLs/year (~5 yr lifetime = 500M live); 10k QPS reads, 100 QPS writes; p99 redirect < 50 ms; 99.95% availability; reads ≫ writes (100:1).

Capacity math (back-of-envelope, write it on the whiteboard)

500M URLs × 500 B avg row (id + longUrl + meta) ≈ 250 GB primary store. Fits a single DynamoDB table or a sharded Postgres.
10k QPS × 200 B response ≈ 2 MB/s egress at app tier; CDN absorbs most.
Cache working set: hot 20% of URLs serve 80% of reads → cache top 100M entries × 100 B ≈ 10 GB Redis (one r6g.xlarge).
Analytics: 10k QPS × 100 B event × 86400 ≈ 86 GB/day raw; aggregated rollups much smaller.

High-level

Client → CDN (1m TTL on 3xx) → Anycast LB → API GW → URL Service ──► DynamoDB (shortId PK)
                                                          │
                                                          └─► Redis (LRU, 99% hit)
                                                          │
                                                          └─► outbox → Kafka → ClickHouse (analytics)

API

POST /shorten { url, customSlug?, ttlDays? } → 201 { shortId, shortUrl, expiresAt }. Idempotency-Key header required.
GET /:id → 301 + Cache-Control: public, max-age=60. (301 caches indefinitely in browsers — use 302 if you ever revoke; trade-off: extra traffic.)
DELETE /:id (owner only).
GET /:id/stats → owner analytics (separate datastore).

Data model

Primary:    shortId (PK, base62), longUrl, ownerId, createdAt, expiresAt, hits (eventually-consistent)
ByOwner:    ownerId (PK), shortId (SK) — GSI for "list my links"
Analytics:  ClickHouse (ts, shortId, geo, ua, referrer) — append-only

ID generation — defend the choice

Option	Pros	Cons	Verdict
7-char random base62 (62^7 ≈ 3.5T)	No coordination; collision rate ~birthday-paradox negligible	Need `INSERT IF NOT EXISTS` retry; not URL-guessing-safe	Default for public shortener
Snowflake → base62	Sortable, no collision	Reveals creation rate to anyone watching	Internal use
Hash(longUrl) prefix	Same URL → same short (dedup)	Different URLs collide; not revocable per-user	Rejected — kills owner semantics
DB sequence	Simple	Single point of contention; predictable IDs	Rejected at this scale

Read path (the 99% case)

CDN check → most 301s served at edge.
Miss → API → Redis GET sid:<id> → 99% hit → 301.
Cache miss → DynamoDB GetItem(shortId) (single-digit ms) → set Redis with TTL=1h → respond.
Expired? Return 410 Gone (and Sunset header on the create response).

Write path

Validate URL (length, scheme http/https only, no internal IPs — SSRF guard).
Generate id; PutItem(shortId, ..., ConditionExpression: attribute_not_exists).
On ConditionalCheckFailedException → retry with new id (≤3 attempts, then 503).
Same transaction inserts outbox row → Debezium → Kafka → ClickHouse.

Bottlenecks / hot spots

Celebrity short URL (a tweet links to one short URL → 100k QPS on one key). Fix: pin to multi-AZ Redis with read replicas; probabilistic counter for hits (CMS or HLL) to avoid write contention; in extreme cases, push the redirect into CDN with a high TTL.
Custom slug land grab — race condition on slug uniqueness. Fix: same conditional write; rate-limit slug creation per user.
Write storm at link creation (a popular SaaS bulk-creating 1M links) → token-bucket rate-limit per ownerId.

Failure modes

DynamoDB region down → fail over to Global Tables replica (eventual lag <1s).
Redis cluster down → app degrades to DB-only reads; p99 doubles but stays under 100 ms; raise alert, don’t page.
ClickHouse down → outbox queues; hits lag visible to owner but doesn’t block redirects.
Bad URL (malicious / phishing) → integrate Google Safe Browsing API on POST /shorten; mark on hit, serve interstitial.

Observability

SLI: p99 redirect latency; redirect 5xx rate.
SLO: 99.95% of redirects < 50 ms over 28-day window.
Key dashboards: Redis hit rate (target >98%), DDB throttles (target 0), CDN hit ratio, outbox lag.
Alerts: SLO burn-rate alerts at 2% and 5% over 1h/6h windows.

Likely follow-ups

“Now make it work in 3 regions.” → Global Tables for primary; per-region Redis (don’t cross-region replicate cache); ClickHouse single region with regional ingesters → cross-region replication.
“Owner wants real-time click counts.” → ClickHouse MATERIALIZED VIEW on Kafka topic; owner dashboard queries directly (sub-second).
“Bot abuse on /shorten.” → per-IP + per-user token bucket at edge + reCAPTCHA on anomaly.

2. Twitter feed (home timeline)

Requirements

Functional: post tweet (≤280 chars + media); home timeline (people you follow, reverse-chron + ranked); like/retweet/reply counts.
Non-functional: 200M DAU × ~5 tweets/day → ~12k tweets/sec write; 200M × 50 reads/day → ~120k timeline reads/sec; p99 timeline < 200 ms; fanout amplification = avg 200 followers × 12k tweets/sec ≈ 2.4M timeline writes/sec.

Capacity math

Tweet: ~300 B compressed → 12k × 300 B × 86400 ≈ 300 GB/day new tweets; ~110 TB/year.
Timeline cache: 200M users × 800 cached tweet-IDs × 12 B ≈ 2 TB — sharded Redis cluster, say 200 nodes.
Followers graph: 200M users × 200 avg followers (long tail to 100M) → store edges in dedicated graph store (DynamoDB adjacency list, or specialized like Twitter’s FlockDB historically).

High-level

Tweet write ─► Tweet service ─► Tweet DB (Manhattan / DynamoDB, PK=user, SK=ts)
                  │
                  ├─► Fanout service ─► Redis timelines (ZADD per follower)        [push path]
                  │
                  └─► Search/index pipeline ─► Kafka ─► Elasticsearch

Timeline read ─► Timeline service ─► Redis ZREVRANGE (cached push timeline)
                        │
                        └─► merge celebrity tweets (pull) ─► Tweet DB             [hybrid]
                        │
                        └─► ranker (ML) ─► return top N

Fanout strategies — defend the hybrid

Strategy	Cost	When
Push (write-time fanout)	Write = O(followers) — bad for celebs (Cristiano ~600M followers = 600M writes per tweet)	Default for users < 10k followers
Pull (read-time fanout)	Read = O(followed users) — expensive for users following many	Default for tweets by celebrities
Hybrid	Push for normals; pull-merge for celeb authors	What Twitter actually does

Threshold tuning: 10k followers is rough; pick by p99 fanout latency budget. Pre-compute “is celeb” flag on user; sticky.

Storage layout

Tweets (canonical): KV store sharded by tweetId (snowflake). Hot recent tweets in cache.
User → tweets index: PK=userId, SK=ts DESC — drives author profile pages.
Timelines (Redis ZSET per user): ZADD user:{u}:timeline ts tweetId; trim to ~800 entries (ZREMRANGEBYRANK); TTL 30 days for cold users (rebuild on-demand).
Followers / following: bidirectional adjacency lists, sharded by user ID.

Read path (home timeline)

ZREVRANGE user:{u}:timeline 0 199 → 200 candidate tweet IDs (cached push path).
Merge celebrity tweets: for each celeb in following(u) WHERE isCeleb, MGET last 50 → merge sorted by ts.
Multiget tweets by ID (batch MGET tweets).
Rank (ML scorer) → return top 100.
Cache the assembled page for 30 s (next pull-down refresh hits cache).

Write path (tweet)

Persist tweet (canonical store).
Update author’s profile timeline.
Enqueue fanout job (Kafka).
Fanout worker:
- Get follower list (paginated; for normal users one batch).
- For each follower NOT marked “do not push” (inactive >30d), ZADD to their timeline.
- For celeb authors: skip fanout entirely; readers will pull.
Async: index for search; analytics events.

Bottlenecks / hot spots

Celebrity tweet spike read traffic — 600M followers refresh in minutes. Solution: pull-on-read + cache the celeb’s last-N tweets globally (single hot key replicated via Redis read replicas + in-process L1).
Mega-fanout latency for the in-between users (1M followers): batch fanout into 10k-follower chunks, parallelize across workers, target p99 fanout completion < 5 s.
Inactive user fanout waste — 50% of accounts haven’t logged in for 30 days. Skip them; rebuild timeline on next login from pull path (slower first load OK).
Cache stampede on celeb tweet — use request coalescing (singleflight) at the timeline service.

Failure modes

Fanout worker lag → push timelines stale; UI shows “X new tweets” via separate counter (also cached). Page Stories at >60 s lag.
Redis shard down → that shard’s users see pull-path-only (slow but correct).
Tweet DB hot partition (a celeb posts 100× in a minute) → pre-shard celeb users into sub-partitions by (userId, hourBucket).

Observability

SLI: p99 timeline read; p99 fanout completion; cache hit rate per shard.
SLO: 99.9% of home timeline reads < 200 ms; 99% of tweets in followers’ timelines within 10 s.
Watchlist: fanout queue depth; Redis evictions; ranker latency.

Likely follow-ups

“How to add replies/quote-tweets without exploding storage?” → store as tweets with inReplyTo pointer; thread reconstruction is a separate read path with its own cache.
“What if a celeb deletes a tweet?” → write tombstone to Tweet DB; readers filter; eventual GC from timelines.
“Rate-limit posting?” → per-user token bucket (300/3h is Twitter’s), enforced at API GW.

3. Chat (1:1 + group, real-time)

Requirements

Functional: 1:1 + group (≤500) chat; presence; typing indicators; read receipts; offline delivery via push; history pagination; media.
Non-functional: 100M DAU; 10M concurrent WebSocket connections at peak; p99 message delivery < 500 ms in-region; durability — no message lost on server crash; history retention 1 year.

Capacity math

Connections: 10M concurrent / ~50k per node (Linux file-descriptor + memory limits, Envoy/Netty tuned) = ~200 connection-gateway nodes.
Messages: 100M DAU × 50 msgs/day = 5B/day → ~60k msgs/sec avg, 200k+ peak.
Message size: ~500 B incl. metadata → 60k × 500 B × 86400 ≈ 2.5 TB/day; ~1 PB/year (need cold tiering).
Presence: 10M × 1 Hz heartbeat = 10M ops/sec on Redis — shard heavily.

High-level

Client ──WS──► Conn Gateway (sticky) ─► Chat Service ─► Message DB (Cassandra/Scylla, PK=convId, CK=ts)
                       │                       │
                       │                       └─► Kafka (msg events) ─► fanout dispatcher
                       │                                                       │
                       └◄─── (msg push) ◄─────  Redis pub/sub (per-conn-node) ◄┘
                                                                                │
                                                            offline ─► APNS / FCM

Connection layer — critical detail

WebSocket pinned to gateway node via consistent hashing on userId; control plane maintains userId → nodeId in Redis (TTL 30s, heartbeat refresh).
Sticky load balancer (Envoy with RING_HASH LB) keeps reconnects on the same node when possible (avoids cache thrash).
Node failure: client reconnects, lands on new node via consistent-hash next-bucket; in-flight unack’d messages re-delivered (idempotent via clientMessageId).

Send flow

Client sends { convId, clientMsgId, body } over WS.
Gateway forwards to Chat Service.
Chat Service:
- INSERT into messages table — Cassandra with convId as partition key, ts (timeuuid) as clustering key; LWT only for clientMsgId dedup (avoid the slow path unless needed).
- Publish to Kafka topic messages.{shard}.
Fanout dispatcher consumes Kafka; for each recipient:
- Look up their conn node in Redis.
- Publish to that node’s channel.
- Node delivers over WS; awaits client ACK; retry on disconnect.
Recipient offline → write to APNS/FCM queue + persist in pending_pushes table.
Sender receives MESSAGE_SAVED ACK once Cassandra write commits (≤50 ms).

Read history

SELECT FROM messages WHERE convId=? AND ts < ? LIMIT 50 (cursor on ts).
Cassandra clustering order means reverse-chron is a single seek per partition — fast.
Avoid loading all members on every fetch — fetch member list on conversation open, cache.

Presence

SETEX presence:{userId} 30 nodeId on heartbeat (every 10 s).
“online” = key exists.
Cross-node presence subscription: client subscribes via WS → gateway joins Redis pub/sub channel presence:{contactList} (or batched user-list query every N seconds — cheaper than per-key SUB).

Bottlenecks / hot spots

Mega-group (500 members) message → 500 fanout sends per message; pre-fetch member list, batch Redis pub/sub publishes.
Large group + offline majority → bulk-enqueue push notifications; APNS/FCM have batch APIs.
Cassandra wide-row for very long conversations → cap partition at 10k messages (bucket by month: convId_2026-05).
Redis presence churn on flapping connections → use longer TTL (60s) + explicit DEL on graceful disconnect.

Failure modes

Conn node OOM under spike → connection limit per node; LB sheds excess (503 upgrade response, client retries elsewhere).
Kafka partition lag → fanout delayed; show “delivering…” indicator in UI; messages persisted so no loss.
Push provider (APNS) outage → DLQ + exponential backoff; eventually drop with audit log; in-app badge updates when user opens.
Duplicate delivery (client lost ACK) → dedup on clientMessageId in client-side store.

Observability

SLI: p99 send-to-deliver latency; WS reconnect rate; push success rate.
SLO: 99.9% messages delivered in-region within 500 ms; 99% push success within 30 s.
Watchlist: connections per node, Cassandra p99 write, Kafka consumer lag, Redis presence ops/sec.

Likely follow-ups

“E2EE?” → Signal protocol or libsignal; server stores only ciphertext; key exchange via published prekeys; complicates group (sender keys / MLS).
“Search across history?” → Kafka → Elasticsearch indexer; per-conversation index permissions enforced at query time.
“Multi-device sync?” → device list per user; fanout to each device’s conn node; per-device ack tracking.

4. Distributed rate limiter

Requirements

Functional: allow/deny per (key, window); multiple algorithms (fixed, sliding log, sliding window, token bucket, leaky bucket); per-route configurable.
Non-functional: >100k decisions/sec; p99 decision < 1 ms (it’s on every request’s hot path); fail-open vs fail-closed configurable per route.

Algorithm choice — defend it

Algorithm	Memory	Accuracy	Burst behavior	Use
Fixed window counter	1 int per key	Boundary burst (2× at window flip)	None	Cheap edge counters
Sliding log (ZSET of timestamps)	O(N) per key	Exact	Smooth	Low-volume strict limits
Sliding window counter	2 ints (current + prev window)	~99% accurate	Smooth	Default for API rate limiting
Token bucket	(tokens, lastRefillTs)	Exact	Bursty up to bucket size	Best for “10 req/s but allow burst of 50”
Leaky bucket	Queue / counter	Exact, paces requests	None (smooths output)	Traffic shaping toward downstream

Token bucket via Redis Lua (atomic, no race)

-- KEYS[1] = bucket key
-- ARGV    = { now_ms, refill_per_ms, capacity, cost }
local data    = redis.call('HMGET', KEYS[1], 'tokens', 'ts')
local tokens  = tonumber(data[1]) or tonumber(ARGV[3])
local last    = tonumber(data[2]) or tonumber(ARGV[1])
local elapsed = tonumber(ARGV[1]) - last
tokens = math.min(tonumber(ARGV[3]), tokens + elapsed * tonumber(ARGV[2]))
local cost = tonumber(ARGV[4])
local allowed = 0
if tokens >= cost then
  allowed = 1
  tokens = tokens - cost
end
redis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[1])
redis.call('PEXPIRE', KEYS[1], 60000)
return { allowed, tokens }

Why Lua: EVAL runs atomically on the Redis single-thread, eliminating the read-modify-write race that pure GET/SET cannot avoid.

Layered defense — depth, not one wall

Edge (Cloudflare / nginx)   per IP, coarse, mostly DDoS shield
        │
API gateway                 per API key / user, business limits
        │
Service                     per endpoint class (expensive routes get tighter limits)
        │
Downstream call sites       circuit breaker (rate-limit is a poor fit here; breaker is)

Scale-out

Redis Cluster sharded by key (hash-tag if you need atomicity across related keys: {user:42}:write and {user:42}:read).
Local-first counters (per-pod) flushed to Redis every 100 ms — saves 95% of RTTs at cost of small over-allowance during the flush interval; acceptable for non-billing limits.
For billing limits, never local-first; always centralized.

Bottlenecks / failure

Redis cell down → fail-open for public endpoints (don’t block legitimate traffic), fail-closed for POST write endpoints (prevent abuse). Configure per route.
Hot key (one API key driving all traffic) → shard by (key, bucketIndex) and sum; or push that customer’s limit-check to a dedicated Redis shard.
Time skew between app servers → use Redis’s TIME command as authoritative now_ms — same source for all callers.

Response shape (matters for client retry hygiene)

HTTP/1.1 429 Too Many Requests
X-RateLimit-Limit: 100
X-RateLimit-Remaining: 0
X-RateLimit-Reset: 1716000060
Retry-After: 12

Observability

SLI: decision p99; Redis Lua error rate.
Watchlist: % requests rate-limited per route; per-customer denial spikes (signals abuse OR a misbehaving integration).

5. Payment flow

Requirements

Functional: authorize + capture; refund; partial refund; chargeback handling; ledger reconciliation; multi-currency.
Non-functional: zero double-charge (highest priority, dwarfs latency); idempotent end-to-end; durable audit; p99 < 1.5 s including PSP round-trip; PCI scope minimized.

Idempotency — defend the design

Client supplies Idempotency-Key (UUIDv4) on every mutating request.
Server stores (key, requestHash, responseBody, status, createdAt) in Postgres with UNIQUE(key).
On replay:
- Same key + same requestHash → return cached response.
- Same key + different requestHash → 409 Conflict (client bug).
- key not found → process; insert row in same transaction as charge.
TTL on idempotency rows: 24h minimum, ideally aligned to PSP retry policy (Stripe = 24h).

Saga orchestration (the canonical flow)

                  ┌── reserveInventory ──► InventoryService
                  ├── authorizePayment ──► PSP (Stripe/Adyen)
Orchestrator (Temporal) ── capturePayment ──► PSP
                  ├── markOrderPaid ─────► OrderService
                  └── notifyCustomer ────► NotificationService

  on failure: compensations run in reverse
       refundCapture → voidAuthorization → releaseInventory

Why orchestration not choreography: payment flow has explicit ordering, branching (e.g., 3DS challenge), and time-outs. Auditors want a single state machine.

Pivot point: capture is the point of no return; everything after it must succeed (retries) — compensation = refund, not rollback. (See microservices theory for saga semantics.)

Ledger — double-entry

Journal: append-only (txId, account, side {debit|credit}, amount, currency, ts, metadata). Every entry has a paired counterpart; sum-by-side per txId must equal zero.
Balance: materialized view OR SUM(amount * sign) per account; for hot accounts use materialized with row-level lock on update.
Invariant check in CI + nightly: SELECT account, SUM(amount * CASE side WHEN 'debit' THEN -1 ELSE 1 END) FROM journal GROUP BY account matches balances. Any drift = bug.

Race conditions

Concurrent capture + refund — SELECT ... FOR UPDATE on the order row; or use Postgres advisory lock keyed by orderId.
Inventory + auth race (auth succeeds, inventory already sold out) — compensate with voidAuthorization; never capture if inventory is gone.
PSP webhook arrives before our DB commit — webhook handler must be idempotent on psp_event_id; store-then-process; reconcile against expected state.

Reconciliation

Daily: pull PSP settlement file (CSV via SFTP for old providers, API for Stripe).
Diff (amount, currency, ourTxId) against journal entries for ts BETWEEN settlement.start AND settlement.end.
Mismatches → alert + manual review queue. Never auto-resolve money differences.

Bottlenecks / failure

PSP timeout (call to Stripe hung at 10 s) — we don’t know if the charge happened. Resolution: poll PSP GET /charges/:id with our idempotency key; never retry blindly.
Workflow worker dies mid-saga — Temporal recovers from durable state; idempotent activities replay safely.
Database failover during capture — saga activity returns RETRY; Temporal re-invokes with same args; idempotency layer dedups.
Webhook flood during PSP retry storm → queue webhooks into Kafka, process by worker pool.

PCI scope

Card data never touches your servers; tokenize via PSP SDK (Stripe Elements / Adyen Web Drop-in).
Server stores only PSP token + last4 + brand.
Reduces PCI from SAQ-D to SAQ-A.

Observability

SLI: auth success rate (by PSP, by issuer, by currency); refund latency; reconciliation diff per day.
SLO: zero unreconciled cents at 30-day close.
Alerts: journal-sum drift; webhook lag > 5 min; auth success drop > 2σ from baseline (PSP outage signal).

6. Notification system

Requirements

Functional: triggered by domain events; route to channel(s) (email, SMS, push, in-app); respect user prefs + quiet hours + opt-outs; multilingual templates; A/B variant support.
Non-functional: 1B notifications/day → ~12k/sec avg, 50k+ peak; provider failures isolated per channel; near-zero duplicates; transactional must beat marketing on priority queue.

High-level

Event (Kafka domain topic) ─► Notification Service (preference + template resolution)
                                     │
                                     ├─► email queue ─► email workers ─► SES/SendGrid
                                     ├─► sms queue   ─► sms workers   ─► Twilio
                                     ├─► push queue  ─► push workers  ─► APNS / FCM
                                     └─► inapp queue ─► inapp writer  ─► user inbox DB

  All workers: idempotency (user, event_id, channel), retry+DLQ, provider failover.

Preference resolution

DB: user_prefs(userId, channel, category, enabled, quietHoursStart, quietHoursEnd, tz).
Service caches prefs in Redis with 5-min TTL; updates invalidate on write.
Resolution: for each (user, event, channel) candidate → check enabled → check category opt-out → check quiet hours in user tz → enqueue or drop.

Templates

Stored in a CMS (or Git-versioned YAML) with { subject, bodyHtml, bodyText, locale }.
Render at send time with user data (don’t pre-render; user may update name between trigger and send).
A/B: template_id has variants; assignment sticky per user.

Idempotency

Key: (userId, eventId, channel). Stored in Redis (TTL 7d) + Postgres sent_notifications table.
Workers SETNX before send; if exists → skip. After send: write to sent_notifications with provider message ID for audit.

Retry / DLQ

Exponential backoff: 30s, 2m, 10m, 1h, 6h.
After 5 attempts → DLQ (Kafka topic notifications.dlq) for analysis + optional manual replay.
Distinguish transient (5xx, timeout) from permanent (invalid email, opted out at provider) — permanent goes straight to DLQ.

Priority lanes

Separate Kafka topics + consumer groups for transactional (password reset, payment receipt) vs marketing.
Transactional workers scale aggressively; marketing throttled to provider sending limits.

Bottlenecks / failure

Provider outage (SES down) → failover route to backup provider; if both fail, queue grows — alert when DLQ rate > 1% of input.
Per-recipient flood (system bug triggers 1k events for one user) — per-user-per-channel rate cap (token bucket: 10/min) before send.
Email reputation hit from spam complaints → bounce/complaint feedback loop → auto-disable for that recipient + audit log.

Observability

SLI: time-to-send p99; provider delivery success rate; DLQ rate.
SLO: 99% of transactional notifications sent within 30 s; bounce rate < 0.5%; complaint rate < 0.1%.
Dashboards: per-channel throughput; per-template open/click (for marketing); DLQ depth.

7. Geo dispatch (Uber/DoorDash-style)

Requirements

Functional: rider requests ride from (lat, lng); system matches nearest available driver within radius/ETA; surge pricing per cell.
Non-functional: 500k drivers in one city, ping at 1 Hz; 10k matches/sec at peak; p99 match < 2 s (driver-side latency includes accept/decline RTT).

Spatial index — defend the choice

Index	Pros	Cons
Geohash	Simple, lexicographic prefix = bbox; widely supported (Redis Geo)	Bad at antimeridian/poles; uneven cell sizes
S2 cells (Google)	Hierarchical, near-uniform area, robust	Library required; less built-in DB support
H3 (Uber)	Hexagonal — uniform neighbor distance, no diagonal/orthogonal mismatch	Less universal than S2
Quadtree	Adaptive density	Rebalancing overhead

Pick S2 or H3 for production; geohash for a quick demo.

High-level

Driver app ── 1Hz ping ──► Location Gateway ─► Redis Geo / Tile38 (live grid)
                                                       │
Rider request ──► Match Service ─► query candidates (S2 ring) ─► score (ETA, rating, fee, recency) ─► offer top K
                                                       │
                                              dispatch ─► driver app via push/WS
                                                       │
                                              accept/decline timeout (15s) → next candidate

Surge: per-cell (pending_requests / available_drivers) over 1-min window → Pricing Service

Match algorithm

Compute requester’s S2 cell (level ~13, ~1km²).
Get k-ring of neighbor cells (radius scales with city density).
Query Redis Geo / Tile38 for drivers in those cells (GEORADIUSBYMEMBER).
Filter: status=available, vehicle type matches, rating ≥ threshold.
Score: ETA (50%) + driver acceptance rate (20%) + fairness (recency since last trip — 30%).
Offer to top candidate; on decline within 15 s → next.

Storage

Live driver state (status, lat, lng, vehicleType, lastSeen) — Redis hash + Geo index. TTL on lastSeen; expired → marked offline.
Trip history — Cassandra/DynamoDB (PK=driverId, SK=tripId/ts).
Surge state — Redis sorted set per cell; aggregator job rolls 1-min window every 10 s.

Bottlenecks / failure

Hot cell at rush hour (downtown) — 50k drivers, 5k requests/min in one cell. Solution: shard cell by sub-cell at higher S2 level when load > threshold; parallelize matching.
Driver location write storm — 500k × 1 Hz = 500k WPS on Redis Geo. Solution: per-region Redis Geo cluster; or use Tile38 which is purpose-built; or batch updates to 2 Hz with interpolation client-side.
Match starvation (popular driver gets all offers) — recency-fairness term in score function.
Driver app offline / network flap — lastSeen > 30 s removes from matchable set; on reconnect, requires explicit “go online” to re-enter.

Failure modes

Redis Geo down → fall back to PostGIS read replica (slower but correct); raise critical alert.
Match service partition → dispatch from neighbor region (with degraded quality), surface “extended wait” to rider.
Driver double-assigned (race between two riders) — SETNX driver:{id}:tripLock with 30 s TTL on offer; release on decline.

Observability

SLI: match p99 latency; offer acceptance rate; unmatched-request rate; ping-to-grid lag.
SLO: 99% of requests matched in < 30 s; 95% of dispatched drivers arrive within stated ETA ± 20%.
Watchlist: Redis Geo write QPS, per-cell surge multiplier (anomaly alert > 3×), driver lastSeen freshness.

Likely follow-ups

“Batch matching (UberPool)” — buffer requests in 5-s window; solve as constrained optimization (assignment problem) per cell; Hungarian algorithm or greedy approximation.
“Driver incentives / heatmap” — derive from surge data; broadcast to driver app as overlay; respects cooldown to avoid driver herding.
“Anti-fraud (GPS spoofing)” — cross-validate against accelerometer + cell-tower triangulation; flag if speed/teleport implausible.

Worked checklist for any interview problem

1. Functional reqs            (what user can do — list 5–7)
2. Non-functional             (QPS, latency p99, durability, availability target)
3. Capacity estimate          (DAU × ops/user/day → QPS; data × retention → storage)
4. API contract               (5–10 endpoints + idempotency + auth)
5. Data model                 (entities + PK/SK/index choices + sharding key)
6. Architecture diagram       (boxes + arrows + data store choices)
7. Read path walkthrough      (the hot path — most reads should be cached)
8. Write path walkthrough     (durability, fanout, async vs sync)
9. Identify bottlenecks       (hot key, hot shard, fanout amplification, tail latency)
10. Failure modes             (DB down, region down, cache stampede, retry storm)
11. Observability + SLOs      (SLIs, dashboards, alerts, burn-rate)
12. Deploy + evolve           (multi-region, GDPR/data residency, cost — if asked)

Timing in a 45-min interview

5 min clarification + capacity math (numbers on whiteboard)
10 min high-level diagram + API + data model
15–20 min deep-dive on the area the interviewer probes (usually a bottleneck)
5 min wrap-up: failure modes, observability, “what would I do with more time”

Common pitfalls (you’ve been warned)

Diving into implementation before clarifying scope.
Skipping capacity math → can’t defend architecture.
One-size-fits-all sharding (always hash) without considering query patterns.
Forgetting the write path (interviewers love asking about it).
No idempotency story.
Treating “cache” as a magic solution without considering invalidation/stampede.
Picking the trendy DB (Cassandra, Spanner) for a problem a Postgres replica would solve.
No failure-mode discussion — interviewers expect it from senior+.