Observability — Practical
Observability — Practical patterns
Section titled “Observability — Practical patterns”OpenTelemetry SDK (Node)
Section titled “OpenTelemetry SDK (Node)”import { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';import { Resource } from '@opentelemetry/resources';import { SemanticResourceAttributes as A } from '@opentelemetry/semantic-conventions';import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
new NodeSDK({ resource: new Resource({ [A.SERVICE_NAME]: 'api', [A.SERVICE_VERSION]: process.env.GIT_SHA ?? 'dev', [A.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV, }), traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }), metricReader: new PeriodicExportingMetricReader({ exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics' }), exportIntervalMillis: 60_000, }), instrumentations: [getNodeAutoInstrumentations()],}).start();Run with node -r ./otel.js dist/server.js.
Custom span around critical code
Section titled “Custom span around critical code”import { trace } from '@opentelemetry/api';const tracer = trace.getTracer('api');
await tracer.startActiveSpan('process_payment', async (span) => { try { span.setAttribute('payment.amount', amount); span.setAttribute('user.id', userId); const r = await processor(amount); span.setStatus({ code: 1 }); // OK return r; } catch (e: any) { span.recordException(e); span.setStatus({ code: 2, message: e.message }); // ERROR throw e; } finally { span.end(); }});Custom metrics
Section titled “Custom metrics”import { metrics } from '@opentelemetry/api';const m = metrics.getMeter('api');const requestCounter = m.createCounter('http_requests_total');const latencyHist = m.createHistogram('http_request_duration_seconds', { advice: { explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5] },});
app.use((req, res, next) => { const start = process.hrtime.bigint(); res.on('finish', () => { const dur = Number(process.hrtime.bigint() - start) / 1e9; const labels = { method: req.method, route: req.route?.path ?? '?', status: String(res.statusCode) }; requestCounter.add(1, labels); latencyHist.record(dur, labels); }); next();});Structured logging with trace correlation (pino)
Section titled “Structured logging with trace correlation (pino)”import pino from 'pino';import { trace } from '@opentelemetry/api';
const logger = pino({ level: 'info', formatters: { level: (l) => ({ level: l }) }, mixin: () => { const span = trace.getActiveSpan(); const ctx = span?.spanContext(); return ctx ? { trace_id: ctx.traceId, span_id: ctx.spanId } : {}; },});
logger.info({ user_id: u.id }, 'user_login');Prometheus metrics endpoint (alternative to OTLP)
Section titled “Prometheus metrics endpoint (alternative to OTLP)”import { Registry, Counter, collectDefaultMetrics } from 'prom-client';
const reg = new Registry();collectDefaultMetrics({ register: reg });const reqs = new Counter({ name: 'http_requests_total', help: '...', labelNames: ['method','route','status'] });reg.registerMetric(reqs);
app.get('/metrics', async (_, res) => { res.set('Content-Type', reg.contentType); res.end(await reg.metrics());});Prometheus alert rule
Section titled “Prometheus alert rule”groups: - name: api rules: - alert: HighErrorRate expr: | (sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))) > 0.02 for: 10m labels: { severity: page, team: platform } annotations: summary: "API error rate > 2% for 10m" runbook: https://wiki/runbooks/api-errors
- alert: SLOErrorBudgetBurn expr: | ( sum(rate(http_requests_total{status=~"5.."}[1h])) / sum(rate(http_requests_total[1h])) ) > (1 - 0.999) * 14.4 for: 5m labels: { severity: page } annotations: { summary: "burning 2% of budget per hour" }Multi-window burn rate (Google SRE) reduces flapping while paging fast on real burns.
Grafana dashboard — golden signals (PromQL)
Section titled “Grafana dashboard — golden signals (PromQL)”# ratesum by (route) (rate(http_requests_total[5m]))
# errorssum by (route) (rate(http_requests_total{status=~"5.."}[5m])) / sum by (route) (rate(http_requests_total[5m]))
# duration p99histogram_quantile(0.99, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
# saturation: queue depthmax by (queue) (rabbitmq_queue_messages_ready)OTel Collector pipeline (collector.yaml)
Section titled “OTel Collector pipeline (collector.yaml)”receivers: otlp: protocols: { grpc: {}, http: {} }processors: batch: { timeout: 5s, send_batch_size: 1024 } memory_limiter: { check_interval: 1s, limit_mib: 512 } tail_sampling: decision_wait: 10s policies: - { name: errors, type: status_code, status_code: { status_codes: [ERROR] } } - { name: slow, type: latency, latency: { threshold_ms: 500 } } - { name: probabilistic, type: probabilistic, probabilistic: { sampling_percentage: 5 } }exporters: otlphttp/tempo: { endpoint: http://tempo:4318 } prometheusremotewrite: { endpoint: http://mimir:9009/api/v1/push } loki: { endpoint: http://loki:3100/loki/api/v1/push }service: pipelines: traces: { receivers: [otlp], processors: [memory_limiter, tail_sampling, batch], exporters: [otlphttp/tempo] } metrics: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [prometheusremotewrite] } logs: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [loki] }Continuous profiling (Pyroscope)
Section titled “Continuous profiling (Pyroscope)”import Pyroscope from '@pyroscope/nodejs';Pyroscope.init({ serverAddress: 'http://pyroscope:4040', appName: 'api',});Pyroscope.start();On-call runbook template (per alert)
Section titled “On-call runbook template (per alert)”# HighErrorRate
## Signalhttp error rate > 2% for 10m on api service
## Likely causes- Recent deploy bug- Downstream DB/Redis outage- Surge in traffic
## Investigation1. Check Grafana "API errors by route" panel.2. Filter logs: service="api" AND level="ERROR" last 30m.3. Look at deploys near alert time.4. Check DB CPU + connection pool.
## Mitigations- Roll back: `kubectl rollout undo deploy/api`- Scale up if saturation: `kubectl scale deploy/api --replicas=20`- Disable feature flag: ...
## EscalationPage on-call DBA if DB issue.Tools cheat-sheet
Section titled “Tools cheat-sheet”- OTel SDK + Collector — instrument + route.
- Prometheus + Alertmanager + Grafana — open source classic.
- Loki + Promtail / Vector — logs.
- Tempo / Jaeger — traces.
- Pyroscope / Parca — profiles.
- Honeycomb / Datadog / NewRelic — managed.
- Grafana k6 — load test → metrics.
- stern / kubectl-tail — multi-pod logs.