SRE / SLO — Basics
SRE / SLO — Basics
Section titled “SRE / SLO — Basics”Site Reliability Engineering (Google)
Section titled “Site Reliability Engineering (Google)”“Class SRE implements interface DevOps.”
SRE is the engineering practice that takes operating production seriously: applies software engineering to ops. Born at Google, popularized via Site Reliability Engineering book.
Key concepts
Section titled “Key concepts”- SLI (Service Level Indicator): a measurable signal about service health.
- SLO (Service Level Objective): target for an SLI over a window.
- SLA (Service Level Agreement): contractual commitment with consequences.
- Error budget:
1 - SLO. Capacity to fail before customers care. - Toil: manual, repetitive, automatable, no enduring value. Should be < 50% of work.
- Runbook: step-by-step response for an alert/incident.
- Postmortem: blameless writeup after incident.
Picking SLIs
Section titled “Picking SLIs”Categories:
- Request-driven: availability (% non-5xx), latency (% requests < threshold), correctness.
- Pipeline / batch: freshness, completeness, throughput.
- Storage: durability, availability, latency.
Make SLIs reflect user experience, not internal metrics. CPU isn’t an SLI; “checkout request succeeds < 500ms” is.
SLO math
Section titled “SLO math”If SLO = 99.9% over 30 days:
- Allowed downtime: 30d × 0.001 = 43.2 minutes.
- Allowed errors: 0.1% of requests.
Error budget = 0.1% of requests OR 43 min. Burn it on risk taking; replenish each window.
When budget is nearly burned
Section titled “When budget is nearly burned”- Freeze risky changes.
- Focus on reliability backlog.
- Improve test coverage / canary length.
- Investigate biggest contributors.
Burn-rate alerts (multi-window)
Section titled “Burn-rate alerts (multi-window)”Don’t alert on raw error rate. Alert on how fast budget is being consumed.
| Burn rate | Time to exhaust 30-day budget | Alert |
|---|---|---|
| 14.4× | 2 hours | page (within 5min) |
| 6× | 5 hours | page (within 30min) |
| 1× | 30 days | ticket |
Use 2 windows (e.g. 5min + 1h) — both must trigger to reduce flapping.
- Manual, repetitive, no engineering output, scales linearly with service.
- Track time spent. Target < 50% of any SRE’s time.
- Automate via runbooks → scripts → controllers → operators.
Reliability practices
Section titled “Reliability practices”- Postmortems — blameless. Capture timeline, contributing factors (not “root cause”, multiple), action items.
- Incident review meeting — share lessons.
- Game days / chaos engineering — practice failures.
- DiRT (Disaster Recovery Testing) — Google practice.
- Capacity planning — load tests + headroom.
- Auto-scaling + auto-healing — pets vs cattle.
- Rollback path — every deploy must be reversible.
Severity levels
Section titled “Severity levels”| Sev | Customer impact | Response |
|---|---|---|
| 1 | major outage | page 24/7, all-hands |
| 2 | degraded | page during business |
| 3 | minor | ticket, fix in days |
| 4 | cosmetic | backlog |
Alert fatigue
Section titled “Alert fatigue”If on-call is paged > N times/week without action, alerts are noise.
- Tune thresholds.
- Auto-remediate where possible.
- Move noisy ones to ticket-only.
- Demand actionable runbook for every page-level alert.