Performance & Profiling — Basics
Performance & Profiling — Basics
Section titled “Performance & Profiling — Basics”Profile types
Section titled “Profile types”- CPU sampling — periodically capture stack; aggregate. Low overhead. Most common.
- CPU instrumentation — exact counts, higher overhead. Rarely needed.
- Heap / memory — allocations or live objects. Find leaks.
- Wall-clock — includes time spent waiting (I/O), unlike CPU.
- Off-CPU — time blocked. Useful for I/O-bound services.
- Lock / contention — where threads wait for mutexes.
- Goroutine / event loop — language-specific.
Methodology — pick one
Section titled “Methodology — pick one”USE (Brendan Gregg) — for resources
Section titled “USE (Brendan Gregg) — for resources”For each resource (CPU, mem, disk, net, FD, queue):
- Utilization — % busy.
- Saturation — work waiting (queue depth, run queue).
- Errors.
RED — for services
Section titled “RED — for services”- Rate, Errors, Duration.
TSA (Thread State Analysis)
Section titled “TSA (Thread State Analysis)”Track time in: on-CPU, runnable, sleep (I/O wait), blocked (lock).
Flame graph
Section titled “Flame graph”Visualization: x = aggregated time, y = stack depth. Wide bars = hot. Excellent for spotting hot functions.
Tools: brendangregg’s flamegraph.pl, async-profiler (Java), pprof (Go), pprof-rs (Rust), py-spy (Python), 0x (Node), Pyroscope (continuous).
Read top-down: stacks share prefix (parent frames). Plateaus = hot leaf.
Memory profiling
Section titled “Memory profiling”- Allocation — what allocates a lot? (drives GC pressure).
- Heap snapshot — what’s currently live? (find leaks).
- Snapshot N apart in time → diff to find growing classes.
Tools: Chrome DevTools (Node), heaptrack (C/C++), pprof (Go heap), tracemalloc (Python), VisualVM/Eclipse MAT (Java).
Tail latency
Section titled “Tail latency”- Mean lies; report p50/p90/p99/p999.
- p999 sees rare events (GC pauses, slow DB query, network blip).
- Coordinated omission — many bench tools mask tail under saturation.
- Trace + log slow requests; investigate per-request via tracing.
Common bottlenecks
Section titled “Common bottlenecks”- DB — N+1, missing index, long lock, slow query.
- External API — chained sync calls, no timeouts, retry storm.
- Serialization — JSON parse/stringify on huge payload.
- GC — large live heap → stop-the-world.
- Lock contention — single mutex around hot path.
- Sync I/O on event loop — Node, asyncio.
- Cold cache — TLB, page cache, app cache.
Backpressure / saturation
Section titled “Backpressure / saturation”When demand > capacity:
- Queues grow.
- Latency rises.
- Errors / timeouts cascade.
Solutions:
- Rate limit / load shed.
- Bulkhead resource pools.
- Async + buffer with cap.
- Auto-scale.
- Reduce per-request work.
Tools by language
Section titled “Tools by language”| Language | CPU | Heap |
|---|---|---|
| Node | node --prof, 0x, clinic, Chrome DevTools | DevTools heap snapshot, --inspect |
| Python | cProfile, py-spy, scalene | tracemalloc, objgraph, memray |
| Go | pprof (CPU/heap/goroutine/block) | same |
| Java/Kotlin | async-profiler, JFR | Eclipse MAT, VisualVM |
| Ruby | stackprof, vernier | derailed_benchmarks |
| .NET | dotnet-trace, PerfView | dotnet-dump, dotMemory |
Continuous profiling
Section titled “Continuous profiling”Always-on, low-overhead. Captures profiles even of issues that happen rarely.
Tools: Pyroscope, Parca, Datadog Continuous Profiler, Google Cloud Profiler, Polar Signals. Visualize flamegraphs over time, compare versions.
Observability ↔ profiling
Section titled “Observability ↔ profiling”- Metrics tell you “p99 spiked at 12:03”.
- Tracing tells you “in service X, span DB.query took 800ms”.
- Profiling tells you “function fooParse took 60% of CPU”.
Use them together.