Skip to content

Observability — Theory

Observability — Theory (interview deep-dive)

Section titled “Observability — Theory (interview deep-dive)”
  • Monitoring — known unknowns: things you predicted to track.
  • Observability — unknown unknowns: ability to ask new questions of the system without redeploying.

Modern systems demand observability. You can’t predict every failure mode of a microservices mesh; design for inspection.

Time series count = product of label cardinalities. user_id as a label on a request counter → millions of series → backend explodes.

Rules:

  • Don’t put high-cardinality dimensions on metrics (user id, request id, full URL with ids).
  • Use traces or logs for high-cardinality investigation; metrics for aggregates.
  • Bucket numeric values (e.g., bin: "1-10" not raw count).

Prometheus best practice: labels for the dimensions you’d group by in a query.

  • RED for request-driven services (APIs, RPC).
  • USE for resources (CPU, disk, memory, queue, thread pool, connection pool).

Both, not either.

  • Histogram: client emits bucket counts. Aggregable across instances. Quantiles computed at query time. Default choice.
  • Summary: client computes quantiles per instance. Cannot aggregate across instances. Avoid for distributed.

Pre-define histogram buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10).

  • SLI (Service Level Indicator) — measured reality (e.g. p99 latency, error rate).
  • SLO (Objective) — target for SLI (e.g. 99.9% requests < 300ms over 30 days).
  • SLA (Agreement) — contractual SLO with consequences (refunds).
  • Error budget = 1 - SLO. If SLO is 99.9%, budget is 0.1% of requests; once spent, freeze risky changes.

Design alerts on burn rate of error budget, not raw thresholds.

Tracing is expensive — full traces multiply data 10-100×. Sample.

  • Head-based: at root span, decide. Cheap. Loses interesting traces (you don’t know they’ll be slow).
  • Tail-based: collector buffers, decides after trace ends. Keeps slow / errored. Needs collector with memory.
  • Probabilistic: % chance.
  • Adaptive: target rate per service.

Common: 1-10% sampling + always-keep on errors.

A high-volume service can spend 5-20% of infra on telemetry. Levers:

  • Sampling traces.
  • Aggregating logs (drop debug in prod).
  • Limiting log retention.
  • Aggregate-per-pod metrics (avoid per-request).
  • Always JSON.
  • Stable keys: ts, level, service, env, version, host, msg, trace_id, span_id, request_id.
  • Errors: include error.type, error.message, error.stack.
  • Avoid logging the same event from multiple layers.
  1. Why are histograms preferred over summaries in distributed systems? Aggregable across instances.
  2. Cardinality explosion — example and fix. user_id label on a counter; remove and trace instead.
  3. You see p99 spike but no error rate change — what does that mean? Some users hit slow path (slow shard, contention). Investigate via traces filtered to high latency.
  4. How do you design alerting? SLO-based, multi-window burn rate (1h fast, 24h slow). Alert on symptoms, not causes.
  5. Why might Prometheus pull-based be problematic? Short-lived jobs (use Pushgateway), large NAT’d fleets, security boundaries. OTel Collector + push solves.
  6. Trace ID flows across services — how? W3C Trace Context headers. Lib-managed.
  7. You need to debug a one-off slow request from yesterday. Find by trace id from log. If not sampled — improve tail sampling.
  8. OpenTelemetry value over Prometheus + Jaeger directly? One instrumentation, multi-backend, future-proof, vendor-neutral.
  • Alerting on every metric — alert fatigue.
  • Logging stack traces for handled errors. Reduces signal.
  • No correlation IDs — can’t follow a request.
  • Per-request span attributes containing full payloads.
  • Dashboards no one looks at.
  • Alerts firing during deploys (deploy-aware silencing).
  • High-cardinality labels.
  • OpenTelemetry SDK with auto-instrumentation enabled.
  • Trace ID + request ID in every log line.
  • RED/USE dashboards per service.
  • SLO burn-rate alerts in PagerDuty.
  • Centralized log store with 7-30d retention.
  • Tail-based sampling + always-on for errors.
  • Deploy markers on dashboards.
  • Runbook link in every alert.
  • Post-incident review captures missing observability.