SRE / SLO — Theory
SRE / SLO — Theory (interview deep-dive)
Section titled “SRE / SLO — Theory (interview deep-dive)”Why SLOs
Section titled “Why SLOs”Bridges product and engineering on reliability:
- Product: how reliable does it need to be?
- Engineering: how much risk can we take to ship features?
Without SLOs you either over-invest in reliability (slow features) or under-invest (constant incidents).
Picking the right number (not 100%)
Section titled “Picking the right number (not 100%)”100% is the wrong target. Costs grow nonlinearly past 99.9%. Users don’t notice past their own network’s reliability (~99.5% mobile typical).
Tier example:
- Tier 1 (payments, auth): 99.95-99.99%.
- Tier 2 (most APIs): 99.9%.
- Tier 3 (analytics, batch): 99.5%.
- Internal-only: ~99%.
Composing SLOs across services
Section titled “Composing SLOs across services”If service A depends on B and C:
- A’s reliability ≤ B × C (both must work).
- Mitigate via fallbacks, retries, caching, async paths.
Critical user journey SLO is composed across many services. Define journey SLOs (e.g. “complete purchase < 3s, 99.9% success”) not just per-service.
Error budget policy
Section titled “Error budget policy”Document before an incident:
- Crossed 50% budget burn? Flag. PM/Eng agree priorities.
- Crossed 100% budget burn? Freeze new feature deploys. Reliability-only.
- Spent 2x budget? Senior leadership decision.
Postmortem culture
Section titled “Postmortem culture”Blameless = focus on systems, not individuals. The question isn’t “who caused this?” but “what conditions allowed this to slip through?”
Standard postmortem sections:
- Summary.
- Impact (customer-facing duration, error rate, requests affected).
- Detection (how we found out).
- Timeline (UTC, all events).
- Contributing factors (multiple — never one root cause).
- What went well.
- What went poorly.
- Action items (owner, due date, P0/P1/P2).
Tools: PagerDuty, FireHydrant, incident.io, Grafana OnCall, plain Confluence/Notion.
On-call practices
Section titled “On-call practices”- Rotation of suitable size (e.g., 6+ engineers minimum).
- Compensation / time-off-in-lieu.
- Handoff every shift.
- Page only on actionable, customer-impacting issues.
- Reduce after-hours pages over time.
- Game days quarterly.
Chaos engineering
Section titled “Chaos engineering”Inject controlled failures to verify resilience. Steady-state hypothesis: “If I shut down a pod, p99 latency stays < 500ms”.
Tools: Chaos Mesh, Litmus, Gremlin, AWS FIS, Toxiproxy (network).
Start in staging, then carefully in prod with off-hours, blast-radius controls.
Capacity planning
Section titled “Capacity planning”- Load test new endpoints, plot throughput vs latency.
- Scale targets at 50-60% utilization at peak (headroom for failover, hot-spot, deploys).
- Forecast growth (N% MoM).
- Annual capacity review per service.
Common interview Qs
Section titled “Common interview Qs”- What’s an error budget and how do you use it? Above.
- Define an SLO for a payment API. Availability + latency over rolling 30d, e.g. 99.95% requests succeed and 99% complete < 500ms.
- You’re over budget mid-month — what do you do? Freeze risky changes; reliability-only sprint; investigate top contributors.
- A postmortem found a junior engineer ran the bad command — what’s the right outcome? Focus on missing safeguards; system improvements (review, dry-run, RBAC, runbook), not punishment.
- Difference between SLA and SLO? SLA contractual; SLO internal target (typically tighter).
- How do you alert on SLO? Multi-window burn-rate alerts, not raw thresholds.
- Toil reduction example. Manual restart → liveness probe + auto restart; manual scaling → HPA; runbook → GitHub Action.
- SRE vs DevOps? SRE is one implementation of DevOps philosophy with explicit reliability framework.
- How to balance reliability and feature speed? Error budget. When budget is healthy, take risks. When burning, slow down.
- What makes a good runbook? Symptoms, likely causes, diagnostic queries, mitigation steps, escalation path. Tested in game days.
Production incident response loop (3-phase)
Section titled “Production incident response loop (3-phase)”- Detect — alerts, customer reports.
- Mitigate — restore service first; root cause later. Roll back, scale up, drain bad node, disable feature.
- Resolve — fix root cause, postmortem, action items.
Communication: incident commander, scribe, comms (status page) — Google “incident command system”.
Anti-patterns
Section titled “Anti-patterns”- 100% availability SLO.
- Same SLO for tier 1 and internal tooling.
- No documented escalation path.
- Page on CPU not user-impact.
- Root-cause = “person X did Y”.
- Action items never closed.
- Postmortems only on Sev1 → small incidents accumulate.
- Treating SRE as a separate team that’s “operations” while devs throw services over the wall.