Observability — Theory
Observability — Theory (interview deep-dive)
Section titled “Observability — Theory (interview deep-dive)”Monitoring vs observability
Section titled “Monitoring vs observability”- Monitoring — known unknowns: things you predicted to track.
- Observability — unknown unknowns: ability to ask new questions of the system without redeploying.
Modern systems demand observability. You can’t predict every failure mode of a microservices mesh; design for inspection.
Metrics — cardinality is your enemy
Section titled “Metrics — cardinality is your enemy”Time series count = product of label cardinalities. user_id as a label on a request counter → millions of series → backend explodes.
Rules:
- Don’t put high-cardinality dimensions on metrics (user id, request id, full URL with ids).
- Use traces or logs for high-cardinality investigation; metrics for aggregates.
- Bucket numeric values (e.g.,
bin: "1-10"not raw count).
Prometheus best practice: labels for the dimensions you’d group by in a query.
RED / USE — when each
Section titled “RED / USE — when each”- RED for request-driven services (APIs, RPC).
- USE for resources (CPU, disk, memory, queue, thread pool, connection pool).
Both, not either.
Histograms vs summaries
Section titled “Histograms vs summaries”- Histogram: client emits bucket counts. Aggregable across instances. Quantiles computed at query time. Default choice.
- Summary: client computes quantiles per instance. Cannot aggregate across instances. Avoid for distributed.
Pre-define histogram buckets (0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10).
SLI, SLO, SLA, error budget
Section titled “SLI, SLO, SLA, error budget”- SLI (Service Level Indicator) — measured reality (e.g. p99 latency, error rate).
- SLO (Objective) — target for SLI (e.g. 99.9% requests < 300ms over 30 days).
- SLA (Agreement) — contractual SLO with consequences (refunds).
- Error budget =
1 - SLO. If SLO is 99.9%, budget is 0.1% of requests; once spent, freeze risky changes.
Design alerts on burn rate of error budget, not raw thresholds.
Trace sampling
Section titled “Trace sampling”Tracing is expensive — full traces multiply data 10-100×. Sample.
- Head-based: at root span, decide. Cheap. Loses interesting traces (you don’t know they’ll be slow).
- Tail-based: collector buffers, decides after trace ends. Keeps slow / errored. Needs collector with memory.
- Probabilistic: % chance.
- Adaptive: target rate per service.
Common: 1-10% sampling + always-keep on errors.
Cost of observability
Section titled “Cost of observability”A high-volume service can spend 5-20% of infra on telemetry. Levers:
- Sampling traces.
- Aggregating logs (drop debug in prod).
- Limiting log retention.
- Aggregate-per-pod metrics (avoid per-request).
Logs structure recommendations
Section titled “Logs structure recommendations”- Always JSON.
- Stable keys:
ts,level,service,env,version,host,msg,trace_id,span_id,request_id. - Errors: include
error.type,error.message,error.stack. - Avoid logging the same event from multiple layers.
Common interview Qs
Section titled “Common interview Qs”- Why are histograms preferred over summaries in distributed systems? Aggregable across instances.
- Cardinality explosion — example and fix.
user_idlabel on a counter; remove and trace instead. - You see p99 spike but no error rate change — what does that mean? Some users hit slow path (slow shard, contention). Investigate via traces filtered to high latency.
- How do you design alerting? SLO-based, multi-window burn rate (1h fast, 24h slow). Alert on symptoms, not causes.
- Why might Prometheus pull-based be problematic? Short-lived jobs (use Pushgateway), large NAT’d fleets, security boundaries. OTel Collector + push solves.
- Trace ID flows across services — how? W3C Trace Context headers. Lib-managed.
- You need to debug a one-off slow request from yesterday. Find by trace id from log. If not sampled — improve tail sampling.
- OpenTelemetry value over Prometheus + Jaeger directly? One instrumentation, multi-backend, future-proof, vendor-neutral.
Anti-patterns
Section titled “Anti-patterns”- Alerting on every metric — alert fatigue.
- Logging stack traces for handled errors. Reduces signal.
- No correlation IDs — can’t follow a request.
- Per-request span attributes containing full payloads.
- Dashboards no one looks at.
- Alerts firing during deploys (deploy-aware silencing).
- High-cardinality labels.
Production observability checklist
Section titled “Production observability checklist”- OpenTelemetry SDK with auto-instrumentation enabled.
- Trace ID + request ID in every log line.
- RED/USE dashboards per service.
- SLO burn-rate alerts in PagerDuty.
- Centralized log store with 7-30d retention.
- Tail-based sampling + always-on for errors.
- Deploy markers on dashboards.
- Runbook link in every alert.
- Post-incident review captures missing observability.