Performance & Profiling — Practical

Node.js

# Built-in CPU profile
node --prof app.js
# generates isolate-*.log; process:
node --prof-process isolate-*.log > profile.txt

# clinic (event loop, GC, heap)
npm i -g clinic
clinic doctor -- node app.js          # diagnoses
clinic flame -- node app.js           # flame graph
clinic bubbleprof -- node app.js      # async ops

# 0x flame graph
npx 0x app.js

# heap snapshot via SIGUSR2
node --heapsnapshot-signal=SIGUSR2 app.js
kill -USR2 <pid>

Inspect with Chrome DevTools: node --inspect=0.0.0.0:9229 app.js → chrome://inspect.

// event loop lag in code
import { monitorEventLoopDelay } from 'node:perf_hooks';
const h = monitorEventLoopDelay({ resolution: 20 });
h.enable();
setInterval(() => console.log('p99 lag', h.percentile(99) / 1e6, 'ms'), 5000);

Python

# cProfile + snakeviz
python -m cProfile -o prof.out app.py
snakeviz prof.out

# py-spy (no code change, attaches)
py-spy record -o flame.svg --pid <PID> --duration 30
py-spy top --pid <PID>

# scalene (CPU + memory + GPU)
scalene app.py

# tracemalloc (memory)
import tracemalloc
tracemalloc.start()
# ...
snap = tracemalloc.take_snapshot()
for s in snap.statistics('lineno')[:10]: print(s)

# memray
python -m memray run app.py
python -m memray flamegraph memray-app.bin

Go

import _ "net/http/pprof"
go func() { http.ListenAndServe(":6060", nil) }()

# CPU
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/profile?seconds=30

# heap
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/heap

# goroutine
go tool pprof -http=:8080 http://localhost:6060/debug/pprof/goroutine

# benchmark
go test -bench=. -benchmem -cpuprofile=cpu.out -memprofile=mem.out
go tool pprof cpu.out

Java

# async-profiler (recommended)
./profiler.sh -d 30 -f flame.html <pid>

# JFR (Java Flight Recorder)
jcmd <pid> JFR.start name=p settings=profile filename=p.jfr duration=30s

# heap dump
jcmd <pid> GC.heap_dump dump.hprof
# inspect with Eclipse MAT

Linux-level

# perf flame
sudo perf record -F 99 -p <pid> -g -- sleep 30
sudo perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg

# off-CPU
sudo /usr/share/bcc/tools/profile -F 99 -p <pid> 30

# block latency
sudo /usr/share/bcc/tools/biolatency 5

# tcptracer
sudo /usr/share/bcc/tools/tcptracer

Continuous profiling

Pyroscope (Node example):

import Pyroscope from '@pyroscope/nodejs';
Pyroscope.init({
  serverAddress: 'http://pyroscope:4040',
  appName: 'api',
});
Pyroscope.start();

Parca for Go/eBPF-based, datadog continuous profiler for managed.

Load test

# k6
k6 run --vus 100 --duration 1m script.js

# wrk2 (constant-arrival, no coordinated omission)
wrk2 -t8 -c200 -R 5000 --latency -d 1m http://localhost:3000/

# autocannon
autocannon -c 100 -d 30 -p 10 http://localhost:3000/

Always use constant-arrival-rate mode for honest tail latency.

DB profiling

PostgreSQL:

EXPLAIN (ANALYZE, BUFFERS) SELECT ...;

-- Top queries
SELECT calls, total_exec_time::int AS total, mean_exec_time::int AS mean,
       substr(query, 1, 80) AS q
FROM pg_stat_statements ORDER BY total_exec_time DESC LIMIT 10;

MongoDB:

db.coll.explain('executionStats').find({...});
db.setProfilingLevel(1, { slowms: 50 });
db.system.profile.find().sort({millis:-1}).limit(10);

HTTP timing

curl -w '@-' -o /dev/null -s https://api/x <<'EOF'
namelookup:  %{time_namelookup}s
connect:     %{time_connect}s
appconnect:  %{time_appconnect}s
pretransfer: %{time_pretransfer}s
starttransfer: %{time_starttransfer}s
total:       %{time_total}s
EOF

Optimization checklist

When metrics tell vs profiles tell

Metric	Likely tool
p99 latency up	tracing, then CPU profile of hot service
memory growth	heap snapshot, allocation profile
CPU saturation	sampling CPU profile
Event loop lag	perf_hooks, async profiler
GC time high	runtime stats, heap profile
DB slow	DB explain + indexes
Cold-start latency	per-step timing, lazy-load profile