Observability — Theory

Observability — Theory (interview deep-dive)

Monitoring — known unknowns: things you predicted to track.
Observability — unknown unknowns: ability to ask new questions of the system without redeploying.

Modern systems demand observability. You can’t predict every failure mode of a microservices mesh; design for inspection.

Time series count = product of label cardinalities. user_id as a label on a request counter → millions of series → backend explodes.

Rules:

Don’t put high-cardinality dimensions on metrics (user id, request id, full URL with ids).
Use traces or logs for high-cardinality investigation; metrics for aggregates.
Bucket numeric values (e.g., bin: "1-10" not raw count).

Prometheus best practice: labels for the dimensions you’d group by in a query.

RED for request-driven services (APIs, RPC).
USE for resources (CPU, disk, memory, queue, thread pool, connection pool).

Both, not either.

Histogram: client emits bucket counts. Aggregable across instances. Quantiles computed at query time. Default choice.
Summary: client computes quantiles per instance. Cannot aggregate across instances. Avoid for distributed.

Pre-define histogram buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10).

SLI (Service Level Indicator) — measured reality (e.g. p99 latency, error rate).
SLO (Objective) — target for SLI (e.g. 99.9% requests < 300ms over 30 days).
SLA (Agreement) — contractual SLO with consequences (refunds).
Error budget = 1 - SLO. If SLO is 99.9%, budget is 0.1% of requests; once spent, freeze risky changes.

Design alerts on burn rate of error budget, not raw thresholds.

Tracing is expensive — full traces multiply data 10-100×. Sample.

Head-based: at root span, decide. Cheap. Loses interesting traces (you don’t know they’ll be slow).
Tail-based: collector buffers, decides after trace ends. Keeps slow / errored. Needs collector with memory.
Probabilistic: % chance.
Adaptive: target rate per service.

Common: 1-10% sampling + always-keep on errors.

A high-volume service can spend 5-20% of infra on telemetry. Levers:

Always JSON.
Stable keys: ts, level, service, env, version, host, msg, trace_id, span_id, request_id.
Errors: include error.type, error.message, error.stack.
Avoid logging the same event from multiple layers.

Why are histograms preferred over summaries in distributed systems? Aggregable across instances.
Cardinality explosion — example and fix. user_id label on a counter; remove and trace instead.
You see p99 spike but no error rate change — what does that mean? Some users hit slow path (slow shard, contention). Investigate via traces filtered to high latency.
How do you design alerting? SLO-based, multi-window burn rate (1h fast, 24h slow). Alert on symptoms, not causes.
Why might Prometheus pull-based be problematic? Short-lived jobs (use Pushgateway), large NAT’d fleets, security boundaries. OTel Collector + push solves.
Trace ID flows across services — how? W3C Trace Context headers. Lib-managed.
You need to debug a one-off slow request from yesterday. Find by trace id from log. If not sampled — improve tail sampling.
OpenTelemetry value over Prometheus + Jaeger directly? One instrumentation, multi-backend, future-proof, vendor-neutral.