SRE / SLO — Theory

SRE / SLO — Theory (interview deep-dive)

Why SLOs

Bridges product and engineering on reliability:

Product: how reliable does it need to be?
Engineering: how much risk can we take to ship features?

Without SLOs you either over-invest in reliability (slow features) or under-invest (constant incidents).

Picking the right number (not 100%)

100% is the wrong target. Costs grow nonlinearly past 99.9%. Users don’t notice past their own network’s reliability (~99.5% mobile typical).

Tier example:

Tier 1 (payments, auth): 99.95-99.99%.
Tier 2 (most APIs): 99.9%.
Tier 3 (analytics, batch): 99.5%.
Internal-only: ~99%.

Composing SLOs across services

If service A depends on B and C:

A’s reliability ≤ B × C (both must work).
Mitigate via fallbacks, retries, caching, async paths.

Critical user journey SLO is composed across many services. Define journey SLOs (e.g. “complete purchase < 3s, 99.9% success”) not just per-service.

Error budget policy

Document before an incident:

Crossed 50% budget burn? Flag. PM/Eng agree priorities.
Crossed 100% budget burn? Freeze new feature deploys. Reliability-only.
Spent 2x budget? Senior leadership decision.

Postmortem culture

Blameless = focus on systems, not individuals. The question isn’t “who caused this?” but “what conditions allowed this to slip through?”

Standard postmortem sections:

Summary.
Impact (customer-facing duration, error rate, requests affected).
Detection (how we found out).
Timeline (UTC, all events).
Contributing factors (multiple — never one root cause).
What went well.
What went poorly.
Action items (owner, due date, P0/P1/P2).

Tools: PagerDuty, FireHydrant, incident.io, Grafana OnCall, plain Confluence/Notion.

On-call practices

Rotation of suitable size (e.g., 6+ engineers minimum).
Compensation / time-off-in-lieu.
Handoff every shift.
Page only on actionable, customer-impacting issues.
Reduce after-hours pages over time.
Game days quarterly.

Chaos engineering

Inject controlled failures to verify resilience. Steady-state hypothesis: “If I shut down a pod, p99 latency stays < 500ms”.

Tools: Chaos Mesh, Litmus, Gremlin, AWS FIS, Toxiproxy (network).

Start in staging, then carefully in prod with off-hours, blast-radius controls.

Capacity planning

Load test new endpoints, plot throughput vs latency.
Scale targets at 50-60% utilization at peak (headroom for failover, hot-spot, deploys).
Forecast growth (N% MoM).
Annual capacity review per service.

Common interview Qs

What’s an error budget and how do you use it? Above.
Define an SLO for a payment API. Availability + latency over rolling 30d, e.g. 99.95% requests succeed and 99% complete < 500ms.
You’re over budget mid-month — what do you do? Freeze risky changes; reliability-only sprint; investigate top contributors.
A postmortem found a junior engineer ran the bad command — what’s the right outcome? Focus on missing safeguards; system improvements (review, dry-run, RBAC, runbook), not punishment.
Difference between SLA and SLO? SLA contractual; SLO internal target (typically tighter).
How do you alert on SLO? Multi-window burn-rate alerts, not raw thresholds.
Toil reduction example. Manual restart → liveness probe + auto restart; manual scaling → HPA; runbook → GitHub Action.
SRE vs DevOps? SRE is one implementation of DevOps philosophy with explicit reliability framework.
How to balance reliability and feature speed? Error budget. When budget is healthy, take risks. When burning, slow down.
What makes a good runbook? Symptoms, likely causes, diagnostic queries, mitigation steps, escalation path. Tested in game days.

Production incident response loop (3-phase)

Detect — alerts, customer reports.
Mitigate — restore service first; root cause later. Roll back, scale up, drain bad node, disable feature.
Resolve — fix root cause, postmortem, action items.

Communication: incident commander, scribe, comms (status page) — Google “incident command system”.

Anti-patterns

100% availability SLO.
Same SLO for tier 1 and internal tooling.
No documented escalation path.
Page on CPU not user-impact.
Root-cause = “person X did Y”.
Action items never closed.
Postmortems only on Sev1 → small incidents accumulate.
Treating SRE as a separate team that’s “operations” while devs throw services over the wall.