Skip to content

Observability — Basics

  • Logs — structured records of discrete events.
  • Metrics — numeric time series (counter, gauge, histogram).
  • Traces — distributed request flow across services.

Modern view: also profiles (continuous CPU/memory) + events (deploys, alerts) tied together.

CNCF standard. One SDK, one wire format (OTLP), many backends.

  • APIs — instrumentation surface for app code.
  • SDK — implementations (Node, Python, Go, Java, .NET, Ruby, Rust).
  • Auto-instrumentation — wraps common libs (HTTP, DB, queues) without code changes.
  • Collector — receive → process → export. Buffer, batch, route, sample.
  • Wire format: OTLP over gRPC or HTTP.

Replaced fragmented OpenCensus + OpenTracing.

  • Counter — monotonic increasing (requests, errors).
  • Gauge — current value (memory, queue depth, active connections).
  • Histogram — bucketed observations (latency).
  • Summary — quantiles computed client-side (rare).

Pull model: Prometheus scrapes /metrics endpoint. Each metric: name + labels (key=value). Cardinality matters — high-cardinality labels (user id, request id) blow up time series count.

http_requests_total{method="GET", route="/users/:id", status="200"} 1234
http_request_duration_seconds_bucket{route="/users/:id", le="0.5"} 1100

Structured (JSON) > plaintext. Fields: level, time, service, msg, trace_id, request_id, app-specific.

Levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.

Patterns:

  • One log line per logical event.
  • Include correlation IDs.
  • Don’t log secrets / PII.
  • Sample DEBUG/INFO in prod if volume hurts.

A trace = tree of spans. Each span = one operation (HTTP call, DB query, function).

Propagation via headers (W3C Trace Context):

  • traceparent: 00-<trace_id>-<span_id>-<flags>
  • tracestate: vendor data

Spans carry: name, kind (server/client/internal/producer/consumer), times, attrs, events, status.

Sampling:

  • Head-based (decide at root): cheap, may miss errors. Tail-based smarter.
  • Tail-based (decide after seeing whole trace): keeps interesting (slow, errored). Needs collector support.
  • RED — Rate, Errors, Duration. For request-driven services.
  • USE — Utilization, Saturation, Errors. For resources.
  • Four golden signals (Google SRE) — Latency, Traffic, Errors, Saturation.
TypeExamples
MetricsPrometheus, VictoriaMetrics, Thanos, Mimir, Datadog, Cloudwatch
LogsLoki, Elasticsearch, OpenSearch, Splunk, Datadog Logs
TracesJaeger, Tempo, Honeycomb, Datadog APM, AWS X-Ray
All-in-oneGrafana Cloud, Datadog, New Relic, Honeycomb, Dynatrace
  • Dashboards: Grafana most common. Per-service dashboards with golden signals + business metrics.
  • Alerts: route to Alertmanager / PagerDuty / OpsGenie. Page on user-impacting symptoms (error rate up, latency up), not just resource.

The point: jump from a metric anomaly → relevant logs → relevant traces. Make it possible by including trace ID in logs and exemplars in metrics.

  • OpenTelemetry SDK / Collector — instrument + ship.
  • Prometheus + Alertmanager + Grafana — open source standard.
  • Loki — Prometheus-style logs.
  • Tempo — Prometheus-style traces.
  • Jaeger — popular tracing UI.
  • Pyroscope / Parca — continuous profiling.
  • Vector / Fluent Bit — log shipper.
  • kube-prometheus-stack — Helm chart bundling everything.