Performance & Profiling — Theory
Performance & Profiling — Theory
Section titled “Performance & Profiling — Theory”Latency math (Little’s Law)
Section titled “Latency math (Little’s Law)”L = λ × W
- L = avg requests in system.
- λ = arrival rate.
- W = avg residence time (latency).
Implications:
- Doubling traffic without lowering latency → doubles concurrency demand.
- If latency rises, queue length grows fast.
Universal scalability law (USL)
Section titled “Universal scalability law (USL)”C(N) = N / (1 + α(N-1) + β·N(N-1))Throughput vs N (workers/cores). Two coefficients:
- α — contention (serial portion).
- β — coherence (cross-talk overhead).
Real systems plateau and then regress as N grows because β > 0. More cores ≠ more throughput past a point.
Amdahl’s law
Section titled “Amdahl’s law”Speedup limited by serial fraction:
S(N) = 1 / (s + p/N)If 5% serial, max speedup ≈ 20× regardless of how many cores.
Where to start (interview answer template)
Section titled “Where to start (interview answer template)”- Define what’s slow: which request, p99 of which endpoint, on which env.
- Measure: confirm with metrics + traces. Don’t optimize on hunch.
- Bisect: which span / function consumes the time? Profile.
- Hypothesize root cause.
- Fix smallest change.
- Re-measure.
Profiling is the third step, not the first.
Common patterns
Section titled “Common patterns”Saturation (resource bottleneck)
Section titled “Saturation (resource bottleneck)”- Visible as queue growth, rising tail latency.
- Fix: scale, batch, reduce work, parallelize.
Lock contention
Section titled “Lock contention”- Many threads waiting on one mutex.
- Fix: reduce critical section, switch to lock-free data structure, partition state.
GC pauses
Section titled “GC pauses”- Periodic latency spikes that correlate with heap fluctuations.
- Fix: reduce allocs (object pools, sync.Pool), tune GC, smaller heap.
N+1 queries
Section titled “N+1 queries”- Many small DB calls inside a loop.
- Fix: batch / dataloader / eager-load.
Cold start
Section titled “Cold start”- First request after deploy slow.
- Fix: warm pool, prefetch, lazy init for non-critical, JIT precompile.
Coordinated omission
Section titled “Coordinated omission”- Load testers measure response time only when they’re already injecting; they pause if a response is slow.
- This UNDER-reports tail latency.
- Use
wrk2,k6 constant-arrival-ratefor honest results.
Benchmark vs profile
Section titled “Benchmark vs profile”- Benchmark — measure throughput / latency under controlled load.
- Profile — find where time goes inside the program.
You typically need both: benchmark to set targets and quantify gain, profile to identify what to fix.
Microbenchmark pitfalls
Section titled “Microbenchmark pitfalls”- Compiler optimizes dead code away.
- Cold caches.
- JIT not warmed up.
- Power scaling / thermal throttling on laptops.
- Use library that handles these (Java JMH, Go testing.B, Rust criterion, JS benchmark.js).
Common interview Qs
Section titled “Common interview Qs”- p99 latency degraded — investigate. Reproduce; check tracing for slow span; look for resource saturation; recent changes; downstream slowness; check GC; check DB.
- CPU at 30% but latency high — explanation. I/O bound, lock contention, slow downstream, GC stalls, network.
- Service slow only at p999. Rare events: GC, slow downstream, resource cliff. Look at tracing tail samples.
- GC pause causing tail latency in Go/Java. Smaller heap, tuning (
GOGC,-XX:MaxGCPauseMillis), allocation reduction (pool, reuse buffers), arenas. - What’s a flame graph? Stacked, aggregated stack samples; wide = hot.
- CPU vs wall-clock profile — when each? CPU shows compute hotspots; wall-clock shows where it waits (I/O, locks).
- Off-CPU profiling — what? Time spent NOT running (blocked on I/O, lock, sleep). Useful for I/O-bound services.
- Async event loop blocked — symptoms? Latency for unrelated requests rises. Detect via event loop lag instrumentation; fix by offloading sync work to worker threads.
Anti-patterns
Section titled “Anti-patterns”- Optimizing without measuring.
- Average-only metrics.
- “Benchmark says X is faster” without representative load.
- Caching everything (cache invalidation overhead, complexity).
- Premature parallelism.
- Long allocations in hot path.
- Adding more replicas to mask a per-request bug.