Skip to content

Performance & Profiling — Theory

L = λ × W

  • L = avg requests in system.
  • λ = arrival rate.
  • W = avg residence time (latency).

Implications:

  • Doubling traffic without lowering latency → doubles concurrency demand.
  • If latency rises, queue length grows fast.
C(N) = N / (1 + α(N-1) + β·N(N-1))

Throughput vs N (workers/cores). Two coefficients:

  • α — contention (serial portion).
  • β — coherence (cross-talk overhead).

Real systems plateau and then regress as N grows because β > 0. More cores ≠ more throughput past a point.

Speedup limited by serial fraction:

S(N) = 1 / (s + p/N)

If 5% serial, max speedup ≈ 20× regardless of how many cores.

Where to start (interview answer template)

Section titled “Where to start (interview answer template)”
  1. Define what’s slow: which request, p99 of which endpoint, on which env.
  2. Measure: confirm with metrics + traces. Don’t optimize on hunch.
  3. Bisect: which span / function consumes the time? Profile.
  4. Hypothesize root cause.
  5. Fix smallest change.
  6. Re-measure.

Profiling is the third step, not the first.

  • Visible as queue growth, rising tail latency.
  • Fix: scale, batch, reduce work, parallelize.
  • Many threads waiting on one mutex.
  • Fix: reduce critical section, switch to lock-free data structure, partition state.
  • Periodic latency spikes that correlate with heap fluctuations.
  • Fix: reduce allocs (object pools, sync.Pool), tune GC, smaller heap.
  • Many small DB calls inside a loop.
  • Fix: batch / dataloader / eager-load.
  • First request after deploy slow.
  • Fix: warm pool, prefetch, lazy init for non-critical, JIT precompile.
  • Load testers measure response time only when they’re already injecting; they pause if a response is slow.
  • This UNDER-reports tail latency.
  • Use wrk2, k6 constant-arrival-rate for honest results.
  • Benchmark — measure throughput / latency under controlled load.
  • Profile — find where time goes inside the program.

You typically need both: benchmark to set targets and quantify gain, profile to identify what to fix.

  • Compiler optimizes dead code away.
  • Cold caches.
  • JIT not warmed up.
  • Power scaling / thermal throttling on laptops.
  • Use library that handles these (Java JMH, Go testing.B, Rust criterion, JS benchmark.js).
  1. p99 latency degraded — investigate. Reproduce; check tracing for slow span; look for resource saturation; recent changes; downstream slowness; check GC; check DB.
  2. CPU at 30% but latency high — explanation. I/O bound, lock contention, slow downstream, GC stalls, network.
  3. Service slow only at p999. Rare events: GC, slow downstream, resource cliff. Look at tracing tail samples.
  4. GC pause causing tail latency in Go/Java. Smaller heap, tuning (GOGC, -XX:MaxGCPauseMillis), allocation reduction (pool, reuse buffers), arenas.
  5. What’s a flame graph? Stacked, aggregated stack samples; wide = hot.
  6. CPU vs wall-clock profile — when each? CPU shows compute hotspots; wall-clock shows where it waits (I/O, locks).
  7. Off-CPU profiling — what? Time spent NOT running (blocked on I/O, lock, sleep). Useful for I/O-bound services.
  8. Async event loop blocked — symptoms? Latency for unrelated requests rises. Detect via event loop lag instrumentation; fix by offloading sync work to worker threads.
  • Optimizing without measuring.
  • Average-only metrics.
  • “Benchmark says X is faster” without representative load.
  • Caching everything (cache invalidation overhead, complexity).
  • Premature parallelism.
  • Long allocations in hot path.
  • Adding more replicas to mask a per-request bug.