Skip to content

SRE / SLO — Theory

SRE / SLO — Theory (interview deep-dive)

Section titled “SRE / SLO — Theory (interview deep-dive)”

Bridges product and engineering on reliability:

  • Product: how reliable does it need to be?
  • Engineering: how much risk can we take to ship features?

Without SLOs you either over-invest in reliability (slow features) or under-invest (constant incidents).

100% is the wrong target. Costs grow nonlinearly past 99.9%. Users don’t notice past their own network’s reliability (~99.5% mobile typical).

Tier example:

  • Tier 1 (payments, auth): 99.95-99.99%.
  • Tier 2 (most APIs): 99.9%.
  • Tier 3 (analytics, batch): 99.5%.
  • Internal-only: ~99%.

If service A depends on B and C:

  • A’s reliability ≤ B × C (both must work).
  • Mitigate via fallbacks, retries, caching, async paths.

Critical user journey SLO is composed across many services. Define journey SLOs (e.g. “complete purchase < 3s, 99.9% success”) not just per-service.

Document before an incident:

  • Crossed 50% budget burn? Flag. PM/Eng agree priorities.
  • Crossed 100% budget burn? Freeze new feature deploys. Reliability-only.
  • Spent 2x budget? Senior leadership decision.

Blameless = focus on systems, not individuals. The question isn’t “who caused this?” but “what conditions allowed this to slip through?”

Standard postmortem sections:

  1. Summary.
  2. Impact (customer-facing duration, error rate, requests affected).
  3. Detection (how we found out).
  4. Timeline (UTC, all events).
  5. Contributing factors (multiple — never one root cause).
  6. What went well.
  7. What went poorly.
  8. Action items (owner, due date, P0/P1/P2).

Tools: PagerDuty, FireHydrant, incident.io, Grafana OnCall, plain Confluence/Notion.

  • Rotation of suitable size (e.g., 6+ engineers minimum).
  • Compensation / time-off-in-lieu.
  • Handoff every shift.
  • Page only on actionable, customer-impacting issues.
  • Reduce after-hours pages over time.
  • Game days quarterly.

Inject controlled failures to verify resilience. Steady-state hypothesis: “If I shut down a pod, p99 latency stays < 500ms”.

Tools: Chaos Mesh, Litmus, Gremlin, AWS FIS, Toxiproxy (network).

Start in staging, then carefully in prod with off-hours, blast-radius controls.

  • Load test new endpoints, plot throughput vs latency.
  • Scale targets at 50-60% utilization at peak (headroom for failover, hot-spot, deploys).
  • Forecast growth (N% MoM).
  • Annual capacity review per service.
  1. What’s an error budget and how do you use it? Above.
  2. Define an SLO for a payment API. Availability + latency over rolling 30d, e.g. 99.95% requests succeed and 99% complete < 500ms.
  3. You’re over budget mid-month — what do you do? Freeze risky changes; reliability-only sprint; investigate top contributors.
  4. A postmortem found a junior engineer ran the bad command — what’s the right outcome? Focus on missing safeguards; system improvements (review, dry-run, RBAC, runbook), not punishment.
  5. Difference between SLA and SLO? SLA contractual; SLO internal target (typically tighter).
  6. How do you alert on SLO? Multi-window burn-rate alerts, not raw thresholds.
  7. Toil reduction example. Manual restart → liveness probe + auto restart; manual scaling → HPA; runbook → GitHub Action.
  8. SRE vs DevOps? SRE is one implementation of DevOps philosophy with explicit reliability framework.
  9. How to balance reliability and feature speed? Error budget. When budget is healthy, take risks. When burning, slow down.
  10. What makes a good runbook? Symptoms, likely causes, diagnostic queries, mitigation steps, escalation path. Tested in game days.

Production incident response loop (3-phase)

Section titled “Production incident response loop (3-phase)”
  1. Detect — alerts, customer reports.
  2. Mitigate — restore service first; root cause later. Roll back, scale up, drain bad node, disable feature.
  3. Resolve — fix root cause, postmortem, action items.

Communication: incident commander, scribe, comms (status page) — Google “incident command system”.

  • 100% availability SLO.
  • Same SLO for tier 1 and internal tooling.
  • No documented escalation path.
  • Page on CPU not user-impact.
  • Root-cause = “person X did Y”.
  • Action items never closed.
  • Postmortems only on Sev1 → small incidents accumulate.
  • Treating SRE as a separate team that’s “operations” while devs throw services over the wall.