System Design — Practical
System Design — Practical (worked examples)
Section titled “System Design — Practical (worked examples)”Companion to theory.md. Theory = concepts; this doc = how to defend each choice under interview probing. Every example follows: scope → capacity math → API → data model → read/write path → bottlenecks → failure modes → observability.
1. URL shortener
Section titled “1. URL shortener”Requirements
Section titled “Requirements”- Functional: create short URL; redirect; expire; custom slug (optional); per-user analytics (clicks, geo, referrer).
- Non-functional: 100M URLs/year (~5 yr lifetime = 500M live); 10k QPS reads, 100 QPS writes; p99 redirect < 50 ms; 99.95% availability; reads ≫ writes (100:1).
Capacity math (back-of-envelope, write it on the whiteboard)
Section titled “Capacity math (back-of-envelope, write it on the whiteboard)”- 500M URLs × 500 B avg row (id + longUrl + meta) ≈ 250 GB primary store. Fits a single DynamoDB table or a sharded Postgres.
- 10k QPS × 200 B response ≈ 2 MB/s egress at app tier; CDN absorbs most.
- Cache working set: hot 20% of URLs serve 80% of reads → cache top 100M entries × 100 B ≈ 10 GB Redis (one r6g.xlarge).
- Analytics: 10k QPS × 100 B event × 86400 ≈ 86 GB/day raw; aggregated rollups much smaller.
High-level
Section titled “High-level”Client → CDN (1m TTL on 3xx) → Anycast LB → API GW → URL Service ──► DynamoDB (shortId PK) │ └─► Redis (LRU, 99% hit) │ └─► outbox → Kafka → ClickHouse (analytics)POST /shorten { url, customSlug?, ttlDays? }→201 { shortId, shortUrl, expiresAt }. Idempotency-Key header required.GET /:id→301+Cache-Control: public, max-age=60. (301 caches indefinitely in browsers — use 302 if you ever revoke; trade-off: extra traffic.)DELETE /:id(owner only).GET /:id/stats→ owner analytics (separate datastore).
Data model
Section titled “Data model”Primary: shortId (PK, base62), longUrl, ownerId, createdAt, expiresAt, hits (eventually-consistent)ByOwner: ownerId (PK), shortId (SK) — GSI for "list my links"Analytics: ClickHouse (ts, shortId, geo, ua, referrer) — append-onlyID generation — defend the choice
Section titled “ID generation — defend the choice”| Option | Pros | Cons | Verdict |
|---|---|---|---|
| 7-char random base62 (62^7 ≈ 3.5T) | No coordination; collision rate ~birthday-paradox negligible | Need INSERT IF NOT EXISTS retry; not URL-guessing-safe | Default for public shortener |
| Snowflake → base62 | Sortable, no collision | Reveals creation rate to anyone watching | Internal use |
| Hash(longUrl) prefix | Same URL → same short (dedup) | Different URLs collide; not revocable per-user | Rejected — kills owner semantics |
| DB sequence | Simple | Single point of contention; predictable IDs | Rejected at this scale |
Read path (the 99% case)
Section titled “Read path (the 99% case)”- CDN check → most 301s served at edge.
- Miss → API → Redis
GET sid:<id>→ 99% hit → 301. - Cache miss → DynamoDB
GetItem(shortId)(single-digit ms) → set Redis with TTL=1h → respond. - Expired? Return 410 Gone (and
Sunsetheader on the create response).
Write path
Section titled “Write path”- Validate URL (length, scheme http/https only, no internal IPs — SSRF guard).
- Generate id;
PutItem(shortId, ..., ConditionExpression: attribute_not_exists). - On
ConditionalCheckFailedException→ retry with new id (≤3 attempts, then 503). - Same transaction inserts outbox row → Debezium → Kafka → ClickHouse.
Bottlenecks / hot spots
Section titled “Bottlenecks / hot spots”- Celebrity short URL (a tweet links to one short URL → 100k QPS on one key). Fix: pin to multi-AZ Redis with read replicas; probabilistic counter for
hits(CMS or HLL) to avoid write contention; in extreme cases, push the redirect into CDN with a high TTL. - Custom slug land grab — race condition on slug uniqueness. Fix: same conditional write; rate-limit slug creation per user.
- Write storm at link creation (a popular SaaS bulk-creating 1M links) → token-bucket rate-limit per
ownerId.
Failure modes
Section titled “Failure modes”- DynamoDB region down → fail over to Global Tables replica (eventual lag <1s).
- Redis cluster down → app degrades to DB-only reads; p99 doubles but stays under 100 ms; raise alert, don’t page.
- ClickHouse down → outbox queues;
hitslag visible to owner but doesn’t block redirects. - Bad URL (malicious / phishing) → integrate Google Safe Browsing API on
POST /shorten; mark on hit, serve interstitial.
Observability
Section titled “Observability”- SLI: p99 redirect latency; redirect 5xx rate.
- SLO: 99.95% of redirects < 50 ms over 28-day window.
- Key dashboards: Redis hit rate (target >98%), DDB throttles (target 0), CDN hit ratio, outbox lag.
- Alerts: SLO burn-rate alerts at 2% and 5% over 1h/6h windows.
Likely follow-ups
Section titled “Likely follow-ups”- “Now make it work in 3 regions.” → Global Tables for primary; per-region Redis (don’t cross-region replicate cache); ClickHouse single region with regional ingesters → cross-region replication.
- “Owner wants real-time click counts.” → ClickHouse
MATERIALIZED VIEWon Kafka topic; owner dashboard queries directly (sub-second). - “Bot abuse on
/shorten.” → per-IP + per-user token bucket at edge + reCAPTCHA on anomaly.
2. Twitter feed (home timeline)
Section titled “2. Twitter feed (home timeline)”Requirements
Section titled “Requirements”- Functional: post tweet (≤280 chars + media); home timeline (people you follow, reverse-chron + ranked); like/retweet/reply counts.
- Non-functional: 200M DAU × ~5 tweets/day → ~12k tweets/sec write; 200M × 50 reads/day → ~120k timeline reads/sec; p99 timeline < 200 ms; fanout amplification = avg 200 followers × 12k tweets/sec ≈ 2.4M timeline writes/sec.
Capacity math
Section titled “Capacity math”- Tweet: ~300 B compressed → 12k × 300 B × 86400 ≈ 300 GB/day new tweets; ~110 TB/year.
- Timeline cache: 200M users × 800 cached tweet-IDs × 12 B ≈ 2 TB — sharded Redis cluster, say 200 nodes.
- Followers graph: 200M users × 200 avg followers (long tail to 100M) → store edges in dedicated graph store (DynamoDB adjacency list, or specialized like Twitter’s FlockDB historically).
High-level
Section titled “High-level”Tweet write ─► Tweet service ─► Tweet DB (Manhattan / DynamoDB, PK=user, SK=ts) │ ├─► Fanout service ─► Redis timelines (ZADD per follower) [push path] │ └─► Search/index pipeline ─► Kafka ─► Elasticsearch
Timeline read ─► Timeline service ─► Redis ZREVRANGE (cached push timeline) │ └─► merge celebrity tweets (pull) ─► Tweet DB [hybrid] │ └─► ranker (ML) ─► return top NFanout strategies — defend the hybrid
Section titled “Fanout strategies — defend the hybrid”| Strategy | Cost | When |
|---|---|---|
| Push (write-time fanout) | Write = O(followers) — bad for celebs (Cristiano ~600M followers = 600M writes per tweet) | Default for users < 10k followers |
| Pull (read-time fanout) | Read = O(followed users) — expensive for users following many | Default for tweets by celebrities |
| Hybrid | Push for normals; pull-merge for celeb authors | What Twitter actually does |
Threshold tuning: 10k followers is rough; pick by p99 fanout latency budget. Pre-compute “is celeb” flag on user; sticky.
Storage layout
Section titled “Storage layout”- Tweets (canonical): KV store sharded by
tweetId(snowflake). Hot recent tweets in cache. - User → tweets index: PK=
userId, SK=ts DESC— drives author profile pages. - Timelines (Redis ZSET per user):
ZADD user:{u}:timeline ts tweetId; trim to ~800 entries (ZREMRANGEBYRANK); TTL 30 days for cold users (rebuild on-demand). - Followers / following: bidirectional adjacency lists, sharded by user ID.
Read path (home timeline)
Section titled “Read path (home timeline)”ZREVRANGE user:{u}:timeline 0 199→ 200 candidate tweet IDs (cached push path).- Merge celebrity tweets: for each celeb in
following(u) WHERE isCeleb,MGETlast 50 → merge sorted by ts. - Multiget tweets by ID (batch
MGET tweets). - Rank (ML scorer) → return top 100.
- Cache the assembled page for 30 s (next pull-down refresh hits cache).
Write path (tweet)
Section titled “Write path (tweet)”- Persist tweet (canonical store).
- Update author’s profile timeline.
- Enqueue fanout job (Kafka).
- Fanout worker:
- Get follower list (paginated; for normal users one batch).
- For each follower NOT marked “do not push” (inactive >30d),
ZADDto their timeline. - For celeb authors: skip fanout entirely; readers will pull.
- Async: index for search; analytics events.
Bottlenecks / hot spots
Section titled “Bottlenecks / hot spots”- Celebrity tweet spike read traffic — 600M followers refresh in minutes. Solution: pull-on-read + cache the celeb’s last-N tweets globally (single hot key replicated via Redis read replicas + in-process L1).
- Mega-fanout latency for the in-between users (1M followers): batch fanout into 10k-follower chunks, parallelize across workers, target p99 fanout completion < 5 s.
- Inactive user fanout waste — 50% of accounts haven’t logged in for 30 days. Skip them; rebuild timeline on next login from pull path (slower first load OK).
- Cache stampede on celeb tweet — use request coalescing (singleflight) at the timeline service.
Failure modes
Section titled “Failure modes”- Fanout worker lag → push timelines stale; UI shows “X new tweets” via separate counter (also cached). Page Stories at >60 s lag.
- Redis shard down → that shard’s users see pull-path-only (slow but correct).
- Tweet DB hot partition (a celeb posts 100× in a minute) → pre-shard celeb users into sub-partitions by
(userId, hourBucket).
Observability
Section titled “Observability”- SLI: p99 timeline read; p99 fanout completion; cache hit rate per shard.
- SLO: 99.9% of home timeline reads < 200 ms; 99% of tweets in followers’ timelines within 10 s.
- Watchlist: fanout queue depth; Redis evictions; ranker latency.
Likely follow-ups
Section titled “Likely follow-ups”- “How to add replies/quote-tweets without exploding storage?” → store as tweets with
inReplyTopointer; thread reconstruction is a separate read path with its own cache. - “What if a celeb deletes a tweet?” → write tombstone to Tweet DB; readers filter; eventual GC from timelines.
- “Rate-limit posting?” → per-user token bucket (300/3h is Twitter’s), enforced at API GW.
3. Chat (1:1 + group, real-time)
Section titled “3. Chat (1:1 + group, real-time)”Requirements
Section titled “Requirements”- Functional: 1:1 + group (≤500) chat; presence; typing indicators; read receipts; offline delivery via push; history pagination; media.
- Non-functional: 100M DAU; 10M concurrent WebSocket connections at peak; p99 message delivery < 500 ms in-region; durability — no message lost on server crash; history retention 1 year.
Capacity math
Section titled “Capacity math”- Connections: 10M concurrent / ~50k per node (Linux file-descriptor + memory limits, Envoy/Netty tuned) = ~200 connection-gateway nodes.
- Messages: 100M DAU × 50 msgs/day = 5B/day → ~60k msgs/sec avg, 200k+ peak.
- Message size: ~500 B incl. metadata → 60k × 500 B × 86400 ≈ 2.5 TB/day; ~1 PB/year (need cold tiering).
- Presence: 10M × 1 Hz heartbeat = 10M ops/sec on Redis — shard heavily.
High-level
Section titled “High-level”Client ──WS──► Conn Gateway (sticky) ─► Chat Service ─► Message DB (Cassandra/Scylla, PK=convId, CK=ts) │ │ │ └─► Kafka (msg events) ─► fanout dispatcher │ │ └◄─── (msg push) ◄───── Redis pub/sub (per-conn-node) ◄┘ │ offline ─► APNS / FCMConnection layer — critical detail
Section titled “Connection layer — critical detail”- WebSocket pinned to gateway node via consistent hashing on
userId; control plane maintainsuserId → nodeIdin Redis (TTL 30s, heartbeat refresh). - Sticky load balancer (Envoy with
RING_HASHLB) keeps reconnects on the same node when possible (avoids cache thrash). - Node failure: client reconnects, lands on new node via consistent-hash next-bucket; in-flight unack’d messages re-delivered (idempotent via
clientMessageId).
Send flow
Section titled “Send flow”- Client sends
{ convId, clientMsgId, body }over WS. - Gateway forwards to Chat Service.
- Chat Service:
INSERTinto messages table — Cassandra withconvIdas partition key,ts(timeuuid) as clustering key; LWT only for clientMsgId dedup (avoid the slow path unless needed).- Publish to Kafka topic
messages.{shard}.
- Fanout dispatcher consumes Kafka; for each recipient:
- Look up their conn node in Redis.
- Publish to that node’s channel.
- Node delivers over WS; awaits client ACK; retry on disconnect.
- Recipient offline → write to APNS/FCM queue + persist in
pending_pushestable. - Sender receives
MESSAGE_SAVEDACK once Cassandra write commits (≤50 ms).
Read history
Section titled “Read history”SELECT FROM messages WHERE convId=? AND ts < ? LIMIT 50(cursor onts).- Cassandra clustering order means reverse-chron is a single seek per partition — fast.
- Avoid loading all members on every fetch — fetch member list on conversation open, cache.
Presence
Section titled “Presence”SETEX presence:{userId} 30 nodeIdon heartbeat (every 10 s).- “online” = key exists.
- Cross-node presence subscription: client subscribes via WS → gateway joins Redis pub/sub channel
presence:{contactList}(or batched user-list query every N seconds — cheaper than per-key SUB).
Bottlenecks / hot spots
Section titled “Bottlenecks / hot spots”- Mega-group (500 members) message → 500 fanout sends per message; pre-fetch member list, batch Redis pub/sub publishes.
- Large group + offline majority → bulk-enqueue push notifications; APNS/FCM have batch APIs.
- Cassandra wide-row for very long conversations → cap partition at 10k messages (bucket by month:
convId_2026-05). - Redis presence churn on flapping connections → use longer TTL (60s) + explicit
DELon graceful disconnect.
Failure modes
Section titled “Failure modes”- Conn node OOM under spike → connection limit per node; LB sheds excess (
503upgrade response, client retries elsewhere). - Kafka partition lag → fanout delayed; show “delivering…” indicator in UI; messages persisted so no loss.
- Push provider (APNS) outage → DLQ + exponential backoff; eventually drop with audit log; in-app badge updates when user opens.
- Duplicate delivery (client lost ACK) → dedup on
clientMessageIdin client-side store.
Observability
Section titled “Observability”- SLI: p99 send-to-deliver latency; WS reconnect rate; push success rate.
- SLO: 99.9% messages delivered in-region within 500 ms; 99% push success within 30 s.
- Watchlist: connections per node, Cassandra p99 write, Kafka consumer lag, Redis presence ops/sec.
Likely follow-ups
Section titled “Likely follow-ups”- “E2EE?” → Signal protocol or libsignal; server stores only ciphertext; key exchange via published prekeys; complicates group (sender keys / MLS).
- “Search across history?” → Kafka → Elasticsearch indexer; per-conversation index permissions enforced at query time.
- “Multi-device sync?” → device list per user; fanout to each device’s conn node; per-device ack tracking.
4. Distributed rate limiter
Section titled “4. Distributed rate limiter”Requirements
Section titled “Requirements”- Functional: allow/deny per
(key, window); multiple algorithms (fixed, sliding log, sliding window, token bucket, leaky bucket); per-route configurable. - Non-functional: >100k decisions/sec; p99 decision < 1 ms (it’s on every request’s hot path); fail-open vs fail-closed configurable per route.
Algorithm choice — defend it
Section titled “Algorithm choice — defend it”| Algorithm | Memory | Accuracy | Burst behavior | Use |
|---|---|---|---|---|
| Fixed window counter | 1 int per key | Boundary burst (2× at window flip) | None | Cheap edge counters |
| Sliding log (ZSET of timestamps) | O(N) per key | Exact | Smooth | Low-volume strict limits |
| Sliding window counter | 2 ints (current + prev window) | ~99% accurate | Smooth | Default for API rate limiting |
| Token bucket | (tokens, lastRefillTs) | Exact | Bursty up to bucket size | Best for “10 req/s but allow burst of 50” |
| Leaky bucket | Queue / counter | Exact, paces requests | None (smooths output) | Traffic shaping toward downstream |
Token bucket via Redis Lua (atomic, no race)
Section titled “Token bucket via Redis Lua (atomic, no race)”-- KEYS[1] = bucket key-- ARGV = { now_ms, refill_per_ms, capacity, cost }local data = redis.call('HMGET', KEYS[1], 'tokens', 'ts')local tokens = tonumber(data[1]) or tonumber(ARGV[3])local last = tonumber(data[2]) or tonumber(ARGV[1])local elapsed = tonumber(ARGV[1]) - lasttokens = math.min(tonumber(ARGV[3]), tokens + elapsed * tonumber(ARGV[2]))local cost = tonumber(ARGV[4])local allowed = 0if tokens >= cost then allowed = 1 tokens = tokens - costendredis.call('HMSET', KEYS[1], 'tokens', tokens, 'ts', ARGV[1])redis.call('PEXPIRE', KEYS[1], 60000)return { allowed, tokens }Why Lua: EVAL runs atomically on the Redis single-thread, eliminating the read-modify-write race that pure GET/SET cannot avoid.
Layered defense — depth, not one wall
Section titled “Layered defense — depth, not one wall”Edge (Cloudflare / nginx) per IP, coarse, mostly DDoS shield │API gateway per API key / user, business limits │Service per endpoint class (expensive routes get tighter limits) │Downstream call sites circuit breaker (rate-limit is a poor fit here; breaker is)Scale-out
Section titled “Scale-out”- Redis Cluster sharded by
key(hash-tag if you need atomicity across related keys:{user:42}:writeand{user:42}:read). - Local-first counters (per-pod) flushed to Redis every 100 ms — saves 95% of RTTs at cost of small over-allowance during the flush interval; acceptable for non-billing limits.
- For billing limits, never local-first; always centralized.
Bottlenecks / failure
Section titled “Bottlenecks / failure”- Redis cell down → fail-open for public endpoints (don’t block legitimate traffic), fail-closed for
POSTwrite endpoints (prevent abuse). Configure per route. - Hot key (one API key driving all traffic) → shard by
(key, bucketIndex)and sum; or push that customer’s limit-check to a dedicated Redis shard. - Time skew between app servers → use Redis’s
TIMEcommand as authoritativenow_ms— same source for all callers.
Response shape (matters for client retry hygiene)
Section titled “Response shape (matters for client retry hygiene)”HTTP/1.1 429 Too Many RequestsX-RateLimit-Limit: 100X-RateLimit-Remaining: 0X-RateLimit-Reset: 1716000060Retry-After: 12Observability
Section titled “Observability”- SLI: decision p99; Redis Lua error rate.
- Watchlist: % requests rate-limited per route; per-customer denial spikes (signals abuse OR a misbehaving integration).
5. Payment flow
Section titled “5. Payment flow”Requirements
Section titled “Requirements”- Functional: authorize + capture; refund; partial refund; chargeback handling; ledger reconciliation; multi-currency.
- Non-functional: zero double-charge (highest priority, dwarfs latency); idempotent end-to-end; durable audit; p99 < 1.5 s including PSP round-trip; PCI scope minimized.
Idempotency — defend the design
Section titled “Idempotency — defend the design”- Client supplies
Idempotency-Key(UUIDv4) on every mutating request. - Server stores
(key, requestHash, responseBody, status, createdAt)in Postgres withUNIQUE(key). - On replay:
- Same
key+ samerequestHash→ return cached response. - Same
key+ differentrequestHash→409 Conflict(client bug). keynot found → process; insert row in same transaction as charge.
- Same
- TTL on idempotency rows: 24h minimum, ideally aligned to PSP retry policy (Stripe = 24h).
Saga orchestration (the canonical flow)
Section titled “Saga orchestration (the canonical flow)” ┌── reserveInventory ──► InventoryService ├── authorizePayment ──► PSP (Stripe/Adyen)Orchestrator (Temporal) ── capturePayment ──► PSP ├── markOrderPaid ─────► OrderService └── notifyCustomer ────► NotificationService
on failure: compensations run in reverse refundCapture → voidAuthorization → releaseInventoryWhy orchestration not choreography: payment flow has explicit ordering, branching (e.g., 3DS challenge), and time-outs. Auditors want a single state machine.
Pivot point: capture is the point of no return; everything after it must succeed (retries) — compensation = refund, not rollback. (See microservices theory for saga semantics.)
Ledger — double-entry
Section titled “Ledger — double-entry”- Journal: append-only
(txId, account, side {debit|credit}, amount, currency, ts, metadata). Every entry has a paired counterpart; sum-by-side pertxIdmust equal zero. - Balance: materialized view OR
SUM(amount * sign)per account; for hot accounts use materialized with row-level lock on update. - Invariant check in CI + nightly:
SELECT account, SUM(amount * CASE side WHEN 'debit' THEN -1 ELSE 1 END) FROM journal GROUP BY accountmatchesbalances. Any drift = bug.
Race conditions
Section titled “Race conditions”- Concurrent capture + refund —
SELECT ... FOR UPDATEon the order row; or use Postgres advisory lock keyed by orderId. - Inventory + auth race (auth succeeds, inventory already sold out) — compensate with
voidAuthorization; never capture if inventory is gone. - PSP webhook arrives before our DB commit — webhook handler must be idempotent on
psp_event_id; store-then-process; reconcile against expected state.
Reconciliation
Section titled “Reconciliation”- Daily: pull PSP settlement file (CSV via SFTP for old providers, API for Stripe).
- Diff
(amount, currency, ourTxId)against journal entries forts BETWEEN settlement.start AND settlement.end. - Mismatches → alert + manual review queue. Never auto-resolve money differences.
Bottlenecks / failure
Section titled “Bottlenecks / failure”- PSP timeout (call to Stripe hung at 10 s) — we don’t know if the charge happened. Resolution: poll PSP
GET /charges/:idwith our idempotency key; never retry blindly. - Workflow worker dies mid-saga — Temporal recovers from durable state; idempotent activities replay safely.
- Database failover during capture — saga activity returns
RETRY; Temporal re-invokes with same args; idempotency layer dedups. - Webhook flood during PSP retry storm → queue webhooks into Kafka, process by worker pool.
PCI scope
Section titled “PCI scope”- Card data never touches your servers; tokenize via PSP SDK (Stripe Elements / Adyen Web Drop-in).
- Server stores only PSP token + last4 + brand.
- Reduces PCI from SAQ-D to SAQ-A.
Observability
Section titled “Observability”- SLI: auth success rate (by PSP, by issuer, by currency); refund latency; reconciliation diff per day.
- SLO: zero unreconciled cents at 30-day close.
- Alerts: journal-sum drift; webhook lag > 5 min; auth success drop > 2σ from baseline (PSP outage signal).
6. Notification system
Section titled “6. Notification system”Requirements
Section titled “Requirements”- Functional: triggered by domain events; route to channel(s) (email, SMS, push, in-app); respect user prefs + quiet hours + opt-outs; multilingual templates; A/B variant support.
- Non-functional: 1B notifications/day → ~12k/sec avg, 50k+ peak; provider failures isolated per channel; near-zero duplicates; transactional must beat marketing on priority queue.
High-level
Section titled “High-level”Event (Kafka domain topic) ─► Notification Service (preference + template resolution) │ ├─► email queue ─► email workers ─► SES/SendGrid ├─► sms queue ─► sms workers ─► Twilio ├─► push queue ─► push workers ─► APNS / FCM └─► inapp queue ─► inapp writer ─► user inbox DB
All workers: idempotency (user, event_id, channel), retry+DLQ, provider failover.Preference resolution
Section titled “Preference resolution”- DB:
user_prefs(userId, channel, category, enabled, quietHoursStart, quietHoursEnd, tz). - Service caches prefs in Redis with 5-min TTL; updates invalidate on write.
- Resolution: for each
(user, event, channel)candidate → check enabled → check category opt-out → check quiet hours in user tz → enqueue or drop.
Templates
Section titled “Templates”- Stored in a CMS (or Git-versioned YAML) with
{ subject, bodyHtml, bodyText, locale }. - Render at send time with user data (don’t pre-render; user may update name between trigger and send).
- A/B: template_id has variants; assignment sticky per user.
Idempotency
Section titled “Idempotency”- Key:
(userId, eventId, channel). Stored in Redis (TTL 7d) + Postgressent_notificationstable. - Workers
SETNXbefore send; if exists → skip. After send: write tosent_notificationswith provider message ID for audit.
Retry / DLQ
Section titled “Retry / DLQ”- Exponential backoff: 30s, 2m, 10m, 1h, 6h.
- After 5 attempts → DLQ (Kafka topic
notifications.dlq) for analysis + optional manual replay. - Distinguish transient (5xx, timeout) from permanent (invalid email, opted out at provider) — permanent goes straight to DLQ.
Priority lanes
Section titled “Priority lanes”- Separate Kafka topics + consumer groups for
transactional(password reset, payment receipt) vsmarketing. - Transactional workers scale aggressively; marketing throttled to provider sending limits.
Bottlenecks / failure
Section titled “Bottlenecks / failure”- Provider outage (SES down) → failover route to backup provider; if both fail, queue grows — alert when DLQ rate > 1% of input.
- Per-recipient flood (system bug triggers 1k events for one user) — per-user-per-channel rate cap (token bucket: 10/min) before send.
- Email reputation hit from spam complaints → bounce/complaint feedback loop → auto-disable for that recipient + audit log.
Observability
Section titled “Observability”- SLI: time-to-send p99; provider delivery success rate; DLQ rate.
- SLO: 99% of transactional notifications sent within 30 s; bounce rate < 0.5%; complaint rate < 0.1%.
- Dashboards: per-channel throughput; per-template open/click (for marketing); DLQ depth.
7. Geo dispatch (Uber/DoorDash-style)
Section titled “7. Geo dispatch (Uber/DoorDash-style)”Requirements
Section titled “Requirements”- Functional: rider requests ride from
(lat, lng); system matches nearest available driver within radius/ETA; surge pricing per cell. - Non-functional: 500k drivers in one city, ping at 1 Hz; 10k matches/sec at peak; p99 match < 2 s (driver-side latency includes accept/decline RTT).
Spatial index — defend the choice
Section titled “Spatial index — defend the choice”| Index | Pros | Cons |
|---|---|---|
| Geohash | Simple, lexicographic prefix = bbox; widely supported (Redis Geo) | Bad at antimeridian/poles; uneven cell sizes |
| S2 cells (Google) | Hierarchical, near-uniform area, robust | Library required; less built-in DB support |
| H3 (Uber) | Hexagonal — uniform neighbor distance, no diagonal/orthogonal mismatch | Less universal than S2 |
| Quadtree | Adaptive density | Rebalancing overhead |
Pick S2 or H3 for production; geohash for a quick demo.
High-level
Section titled “High-level”Driver app ── 1Hz ping ──► Location Gateway ─► Redis Geo / Tile38 (live grid) │Rider request ──► Match Service ─► query candidates (S2 ring) ─► score (ETA, rating, fee, recency) ─► offer top K │ dispatch ─► driver app via push/WS │ accept/decline timeout (15s) → next candidate
Surge: per-cell (pending_requests / available_drivers) over 1-min window → Pricing ServiceMatch algorithm
Section titled “Match algorithm”- Compute requester’s S2 cell (level ~13, ~1km²).
- Get k-ring of neighbor cells (radius scales with city density).
- Query Redis Geo / Tile38 for drivers in those cells (
GEORADIUSBYMEMBER). - Filter: status=available, vehicle type matches, rating ≥ threshold.
- Score: ETA (50%) + driver acceptance rate (20%) + fairness (recency since last trip — 30%).
- Offer to top candidate; on decline within 15 s → next.
Storage
Section titled “Storage”- Live driver state (
status, lat, lng, vehicleType, lastSeen) — Redis hash + Geo index. TTL on lastSeen; expired → marked offline. - Trip history — Cassandra/DynamoDB (PK=driverId, SK=tripId/ts).
- Surge state — Redis sorted set per cell; aggregator job rolls 1-min window every 10 s.
Bottlenecks / failure
Section titled “Bottlenecks / failure”- Hot cell at rush hour (downtown) — 50k drivers, 5k requests/min in one cell. Solution: shard cell by sub-cell at higher S2 level when load > threshold; parallelize matching.
- Driver location write storm — 500k × 1 Hz = 500k WPS on Redis Geo. Solution: per-region Redis Geo cluster; or use Tile38 which is purpose-built; or batch updates to 2 Hz with interpolation client-side.
- Match starvation (popular driver gets all offers) — recency-fairness term in score function.
- Driver app offline / network flap —
lastSeen > 30 sremoves from matchable set; on reconnect, requires explicit “go online” to re-enter.
Failure modes
Section titled “Failure modes”- Redis Geo down → fall back to PostGIS read replica (slower but correct); raise critical alert.
- Match service partition → dispatch from neighbor region (with degraded quality), surface “extended wait” to rider.
- Driver double-assigned (race between two riders) —
SETNX driver:{id}:tripLockwith 30 s TTL on offer; release on decline.
Observability
Section titled “Observability”- SLI: match p99 latency; offer acceptance rate; unmatched-request rate; ping-to-grid lag.
- SLO: 99% of requests matched in < 30 s; 95% of dispatched drivers arrive within stated ETA ± 20%.
- Watchlist: Redis Geo write QPS, per-cell surge multiplier (anomaly alert > 3×), driver lastSeen freshness.
Likely follow-ups
Section titled “Likely follow-ups”- “Batch matching (UberPool)” — buffer requests in 5-s window; solve as constrained optimization (assignment problem) per cell; Hungarian algorithm or greedy approximation.
- “Driver incentives / heatmap” — derive from surge data; broadcast to driver app as overlay; respects cooldown to avoid driver herding.
- “Anti-fraud (GPS spoofing)” — cross-validate against accelerometer + cell-tower triangulation; flag if speed/teleport implausible.
Worked checklist for any interview problem
Section titled “Worked checklist for any interview problem”1. Functional reqs (what user can do — list 5–7)2. Non-functional (QPS, latency p99, durability, availability target)3. Capacity estimate (DAU × ops/user/day → QPS; data × retention → storage)4. API contract (5–10 endpoints + idempotency + auth)5. Data model (entities + PK/SK/index choices + sharding key)6. Architecture diagram (boxes + arrows + data store choices)7. Read path walkthrough (the hot path — most reads should be cached)8. Write path walkthrough (durability, fanout, async vs sync)9. Identify bottlenecks (hot key, hot shard, fanout amplification, tail latency)10. Failure modes (DB down, region down, cache stampede, retry storm)11. Observability + SLOs (SLIs, dashboards, alerts, burn-rate)12. Deploy + evolve (multi-region, GDPR/data residency, cost — if asked)Timing in a 45-min interview
Section titled “Timing in a 45-min interview”- 5 min clarification + capacity math (numbers on whiteboard)
- 10 min high-level diagram + API + data model
- 15–20 min deep-dive on the area the interviewer probes (usually a bottleneck)
- 5 min wrap-up: failure modes, observability, “what would I do with more time”
Common pitfalls (you’ve been warned)
Section titled “Common pitfalls (you’ve been warned)”- Diving into implementation before clarifying scope.
- Skipping capacity math → can’t defend architecture.
- One-size-fits-all sharding (always hash) without considering query patterns.
- Forgetting the write path (interviewers love asking about it).
- No idempotency story.
- Treating “cache” as a magic solution without considering invalidation/stampede.
- Picking the trendy DB (Cassandra, Spanner) for a problem a Postgres replica would solve.
- No failure-mode discussion — interviewers expect it from senior+.