Observability — Basics

Three pillars

Logs — structured records of discrete events.
Metrics — numeric time series (counter, gauge, histogram).
Traces — distributed request flow across services.

Modern view: also profiles (continuous CPU/memory) + events (deploys, alerts) tied together.

OpenTelemetry (OTel)

CNCF standard. One SDK, one wire format (OTLP), many backends.

APIs — instrumentation surface for app code.
SDK — implementations (Node, Python, Go, Java, .NET, Ruby, Rust).
Auto-instrumentation — wraps common libs (HTTP, DB, queues) without code changes.
Collector — receive → process → export. Buffer, batch, route, sample.
Wire format: OTLP over gRPC or HTTP.

Replaced fragmented OpenCensus + OpenTracing.

Metrics — Prometheus model

Counter — monotonic increasing (requests, errors).
Gauge — current value (memory, queue depth, active connections).
Histogram — bucketed observations (latency).
Summary — quantiles computed client-side (rare).

Pull model: Prometheus scrapes /metrics endpoint. Each metric: name + labels (key=value). Cardinality matters — high-cardinality labels (user id, request id) blow up time series count.

http_requests_total{method="GET", route="/users/:id", status="200"} 1234
http_request_duration_seconds_bucket{route="/users/:id", le="0.5"} 1100

Logs

Structured (JSON) > plaintext. Fields: level, time, service, msg, trace_id, request_id, app-specific.

Levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.

Patterns:

One log line per logical event.
Include correlation IDs.
Don’t log secrets / PII.
Sample DEBUG/INFO in prod if volume hurts.

Distributed tracing

A trace = tree of spans. Each span = one operation (HTTP call, DB query, function).

Propagation via headers (W3C Trace Context):

traceparent: 00-<trace_id>-<span_id>-<flags>
tracestate: vendor data

Spans carry: name, kind (server/client/internal/producer/consumer), times, attrs, events, status.

Sampling:

Head-based (decide at root): cheap, may miss errors. Tail-based smarter.
Tail-based (decide after seeing whole trace): keeps interesting (slow, errored). Needs collector support.

Service-level signals

RED — Rate, Errors, Duration. For request-driven services.
USE — Utilization, Saturation, Errors. For resources.
Four golden signals (Google SRE) — Latency, Traffic, Errors, Saturation.

Common backends

Type	Examples
Metrics	Prometheus, VictoriaMetrics, Thanos, Mimir, Datadog, Cloudwatch
Logs	Loki, Elasticsearch, OpenSearch, Splunk, Datadog Logs
Traces	Jaeger, Tempo, Honeycomb, Datadog APM, AWS X-Ray
All-in-one	Grafana Cloud, Datadog, New Relic, Honeycomb, Dynatrace

Dashboards & alerts

Dashboards: Grafana most common. Per-service dashboards with golden signals + business metrics.
Alerts: route to Alertmanager / PagerDuty / OpsGenie. Page on user-impacting symptoms (error rate up, latency up), not just resource.

Correlation

The point: jump from a metric anomaly → relevant logs → relevant traces. Make it possible by including trace ID in logs and exemplars in metrics.

Tools

OpenTelemetry SDK / Collector — instrument + ship.
Prometheus + Alertmanager + Grafana — open source standard.
Loki — Prometheus-style logs.
Tempo — Prometheus-style traces.
Jaeger — popular tracing UI.
Pyroscope / Parca — continuous profiling.
Vector / Fluent Bit — log shipper.
kube-prometheus-stack — Helm chart bundling everything.