Observability — Basics
Observability — Basics
Section titled “Observability — Basics”Three pillars
Section titled “Three pillars”- Logs — structured records of discrete events.
- Metrics — numeric time series (counter, gauge, histogram).
- Traces — distributed request flow across services.
Modern view: also profiles (continuous CPU/memory) + events (deploys, alerts) tied together.
OpenTelemetry (OTel)
Section titled “OpenTelemetry (OTel)”CNCF standard. One SDK, one wire format (OTLP), many backends.
- APIs — instrumentation surface for app code.
- SDK — implementations (Node, Python, Go, Java, .NET, Ruby, Rust).
- Auto-instrumentation — wraps common libs (HTTP, DB, queues) without code changes.
- Collector — receive → process → export. Buffer, batch, route, sample.
- Wire format: OTLP over gRPC or HTTP.
Replaced fragmented OpenCensus + OpenTracing.
Metrics — Prometheus model
Section titled “Metrics — Prometheus model”- Counter — monotonic increasing (requests, errors).
- Gauge — current value (memory, queue depth, active connections).
- Histogram — bucketed observations (latency).
- Summary — quantiles computed client-side (rare).
Pull model: Prometheus scrapes /metrics endpoint. Each metric: name + labels (key=value). Cardinality matters — high-cardinality labels (user id, request id) blow up time series count.
http_requests_total{method="GET", route="/users/:id", status="200"} 1234http_request_duration_seconds_bucket{route="/users/:id", le="0.5"} 1100Structured (JSON) > plaintext. Fields: level, time, service, msg, trace_id, request_id, app-specific.
Levels: TRACE, DEBUG, INFO, WARN, ERROR, FATAL.
Patterns:
- One log line per logical event.
- Include correlation IDs.
- Don’t log secrets / PII.
- Sample DEBUG/INFO in prod if volume hurts.
Distributed tracing
Section titled “Distributed tracing”A trace = tree of spans. Each span = one operation (HTTP call, DB query, function).
Propagation via headers (W3C Trace Context):
traceparent: 00-<trace_id>-<span_id>-<flags>tracestate: vendor data
Spans carry: name, kind (server/client/internal/producer/consumer), times, attrs, events, status.
Sampling:
- Head-based (decide at root): cheap, may miss errors. Tail-based smarter.
- Tail-based (decide after seeing whole trace): keeps interesting (slow, errored). Needs collector support.
Service-level signals
Section titled “Service-level signals”- RED — Rate, Errors, Duration. For request-driven services.
- USE — Utilization, Saturation, Errors. For resources.
- Four golden signals (Google SRE) — Latency, Traffic, Errors, Saturation.
Common backends
Section titled “Common backends”| Type | Examples |
|---|---|
| Metrics | Prometheus, VictoriaMetrics, Thanos, Mimir, Datadog, Cloudwatch |
| Logs | Loki, Elasticsearch, OpenSearch, Splunk, Datadog Logs |
| Traces | Jaeger, Tempo, Honeycomb, Datadog APM, AWS X-Ray |
| All-in-one | Grafana Cloud, Datadog, New Relic, Honeycomb, Dynatrace |
Dashboards & alerts
Section titled “Dashboards & alerts”- Dashboards: Grafana most common. Per-service dashboards with golden signals + business metrics.
- Alerts: route to Alertmanager / PagerDuty / OpsGenie. Page on user-impacting symptoms (error rate up, latency up), not just resource.
Correlation
Section titled “Correlation”The point: jump from a metric anomaly → relevant logs → relevant traces. Make it possible by including trace ID in logs and exemplars in metrics.
- OpenTelemetry SDK / Collector — instrument + ship.
- Prometheus + Alertmanager + Grafana — open source standard.
- Loki — Prometheus-style logs.
- Tempo — Prometheus-style traces.
- Jaeger — popular tracing UI.
- Pyroscope / Parca — continuous profiling.
- Vector / Fluent Bit — log shipper.
- kube-prometheus-stack — Helm chart bundling everything.