Skip to content

Performance & Profiling — Basics

  • CPU sampling — periodically capture stack; aggregate. Low overhead. Most common.
  • CPU instrumentation — exact counts, higher overhead. Rarely needed.
  • Heap / memory — allocations or live objects. Find leaks.
  • Wall-clock — includes time spent waiting (I/O), unlike CPU.
  • Off-CPU — time blocked. Useful for I/O-bound services.
  • Lock / contention — where threads wait for mutexes.
  • Goroutine / event loop — language-specific.

For each resource (CPU, mem, disk, net, FD, queue):

  • Utilization — % busy.
  • Saturation — work waiting (queue depth, run queue).
  • Errors.
  • Rate, Errors, Duration.

Track time in: on-CPU, runnable, sleep (I/O wait), blocked (lock).

Visualization: x = aggregated time, y = stack depth. Wide bars = hot. Excellent for spotting hot functions.

Tools: brendangregg’s flamegraph.pl, async-profiler (Java), pprof (Go), pprof-rs (Rust), py-spy (Python), 0x (Node), Pyroscope (continuous).

Read top-down: stacks share prefix (parent frames). Plateaus = hot leaf.

  • Allocation — what allocates a lot? (drives GC pressure).
  • Heap snapshot — what’s currently live? (find leaks).
  • Snapshot N apart in time → diff to find growing classes.

Tools: Chrome DevTools (Node), heaptrack (C/C++), pprof (Go heap), tracemalloc (Python), VisualVM/Eclipse MAT (Java).

  • Mean lies; report p50/p90/p99/p999.
  • p999 sees rare events (GC pauses, slow DB query, network blip).
  • Coordinated omission — many bench tools mask tail under saturation.
  • Trace + log slow requests; investigate per-request via tracing.
  • DB — N+1, missing index, long lock, slow query.
  • External API — chained sync calls, no timeouts, retry storm.
  • Serialization — JSON parse/stringify on huge payload.
  • GC — large live heap → stop-the-world.
  • Lock contention — single mutex around hot path.
  • Sync I/O on event loop — Node, asyncio.
  • Cold cache — TLB, page cache, app cache.

When demand > capacity:

  • Queues grow.
  • Latency rises.
  • Errors / timeouts cascade.

Solutions:

  • Rate limit / load shed.
  • Bulkhead resource pools.
  • Async + buffer with cap.
  • Auto-scale.
  • Reduce per-request work.
LanguageCPUHeap
Nodenode --prof, 0x, clinic, Chrome DevToolsDevTools heap snapshot, --inspect
PythoncProfile, py-spy, scalenetracemalloc, objgraph, memray
Gopprof (CPU/heap/goroutine/block)same
Java/Kotlinasync-profiler, JFREclipse MAT, VisualVM
Rubystackprof, vernierderailed_benchmarks
.NETdotnet-trace, PerfViewdotnet-dump, dotMemory

Always-on, low-overhead. Captures profiles even of issues that happen rarely.

Tools: Pyroscope, Parca, Datadog Continuous Profiler, Google Cloud Profiler, Polar Signals. Visualize flamegraphs over time, compare versions.

  • Metrics tell you “p99 spiked at 12:03”.
  • Tracing tells you “in service X, span DB.query took 800ms”.
  • Profiling tells you “function fooParse took 60% of CPU”.

Use them together.