Skip to content

Observability — Practical

otel.ts
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-http';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { Resource } from '@opentelemetry/resources';
import { SemanticResourceAttributes as A } from '@opentelemetry/semantic-conventions';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
new NodeSDK({
resource: new Resource({
[A.SERVICE_NAME]: 'api',
[A.SERVICE_VERSION]: process.env.GIT_SHA ?? 'dev',
[A.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV,
}),
traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/traces' }),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT + '/v1/metrics' }),
exportIntervalMillis: 60_000,
}),
instrumentations: [getNodeAutoInstrumentations()],
}).start();

Run with node -r ./otel.js dist/server.js.

import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('api');
await tracer.startActiveSpan('process_payment', async (span) => {
try {
span.setAttribute('payment.amount', amount);
span.setAttribute('user.id', userId);
const r = await processor(amount);
span.setStatus({ code: 1 }); // OK
return r;
} catch (e: any) {
span.recordException(e);
span.setStatus({ code: 2, message: e.message }); // ERROR
throw e;
} finally {
span.end();
}
});
import { metrics } from '@opentelemetry/api';
const m = metrics.getMeter('api');
const requestCounter = m.createCounter('http_requests_total');
const latencyHist = m.createHistogram('http_request_duration_seconds', {
advice: { explicitBucketBoundaries: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5] },
});
app.use((req, res, next) => {
const start = process.hrtime.bigint();
res.on('finish', () => {
const dur = Number(process.hrtime.bigint() - start) / 1e9;
const labels = { method: req.method, route: req.route?.path ?? '?', status: String(res.statusCode) };
requestCounter.add(1, labels);
latencyHist.record(dur, labels);
});
next();
});

Structured logging with trace correlation (pino)

Section titled “Structured logging with trace correlation (pino)”
import pino from 'pino';
import { trace } from '@opentelemetry/api';
const logger = pino({
level: 'info',
formatters: { level: (l) => ({ level: l }) },
mixin: () => {
const span = trace.getActiveSpan();
const ctx = span?.spanContext();
return ctx ? { trace_id: ctx.traceId, span_id: ctx.spanId } : {};
},
});
logger.info({ user_id: u.id }, 'user_login');

Prometheus metrics endpoint (alternative to OTLP)

Section titled “Prometheus metrics endpoint (alternative to OTLP)”
import { Registry, Counter, collectDefaultMetrics } from 'prom-client';
const reg = new Registry();
collectDefaultMetrics({ register: reg });
const reqs = new Counter({ name: 'http_requests_total', help: '...', labelNames: ['method','route','status'] });
reg.registerMetric(reqs);
app.get('/metrics', async (_, res) => {
res.set('Content-Type', reg.contentType);
res.end(await reg.metrics());
});
groups:
- name: api
rules:
- alert: HighErrorRate
expr: |
(sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))) > 0.02
for: 10m
labels: { severity: page, team: platform }
annotations:
summary: "API error rate > 2% for 10m"
runbook: https://wiki/runbooks/api-errors
- alert: SLOErrorBudgetBurn
expr: |
(
sum(rate(http_requests_total{status=~"5.."}[1h]))
/ sum(rate(http_requests_total[1h]))
) > (1 - 0.999) * 14.4
for: 5m
labels: { severity: page }
annotations: { summary: "burning 2% of budget per hour" }

Multi-window burn rate (Google SRE) reduces flapping while paging fast on real burns.

Grafana dashboard — golden signals (PromQL)

Section titled “Grafana dashboard — golden signals (PromQL)”
# rate
sum by (route) (rate(http_requests_total[5m]))
# errors
sum by (route) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by (route) (rate(http_requests_total[5m]))
# duration p99
histogram_quantile(0.99, sum by (le, route) (rate(http_request_duration_seconds_bucket[5m])))
# saturation: queue depth
max by (queue) (rabbitmq_queue_messages_ready)
receivers:
otlp:
protocols: { grpc: {}, http: {} }
processors:
batch: { timeout: 5s, send_batch_size: 1024 }
memory_limiter: { check_interval: 1s, limit_mib: 512 }
tail_sampling:
decision_wait: 10s
policies:
- { name: errors, type: status_code, status_code: { status_codes: [ERROR] } }
- { name: slow, type: latency, latency: { threshold_ms: 500 } }
- { name: probabilistic, type: probabilistic, probabilistic: { sampling_percentage: 5 } }
exporters:
otlphttp/tempo: { endpoint: http://tempo:4318 }
prometheusremotewrite: { endpoint: http://mimir:9009/api/v1/push }
loki: { endpoint: http://loki:3100/loki/api/v1/push }
service:
pipelines:
traces: { receivers: [otlp], processors: [memory_limiter, tail_sampling, batch], exporters: [otlphttp/tempo] }
metrics: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [prometheusremotewrite] }
logs: { receivers: [otlp], processors: [memory_limiter, batch], exporters: [loki] }
import Pyroscope from '@pyroscope/nodejs';
Pyroscope.init({
serverAddress: 'http://pyroscope:4040',
appName: 'api',
});
Pyroscope.start();
# HighErrorRate
## Signal
http error rate > 2% for 10m on api service
## Likely causes
- Recent deploy bug
- Downstream DB/Redis outage
- Surge in traffic
## Investigation
1. Check Grafana "API errors by route" panel.
2. Filter logs: service="api" AND level="ERROR" last 30m.
3. Look at deploys near alert time.
4. Check DB CPU + connection pool.
## Mitigations
- Roll back: `kubectl rollout undo deploy/api`
- Scale up if saturation: `kubectl scale deploy/api --replicas=20`
- Disable feature flag: ...
## Escalation
Page on-call DBA if DB issue.
  • OTel SDK + Collector — instrument + route.
  • Prometheus + Alertmanager + Grafana — open source classic.
  • Loki + Promtail / Vector — logs.
  • Tempo / Jaeger — traces.
  • Pyroscope / Parca — profiles.
  • Honeycomb / Datadog / NewRelic — managed.
  • Grafana k6 — load test → metrics.
  • stern / kubectl-tail — multi-pod logs.