SRE / SLO — Basics

Site Reliability Engineering (Google)

“Class SRE implements interface DevOps.”

SRE is the engineering practice that takes operating production seriously: applies software engineering to ops. Born at Google, popularized via Site Reliability Engineering book.

Key concepts

SLI (Service Level Indicator): a measurable signal about service health.
SLO (Service Level Objective): target for an SLI over a window.
SLA (Service Level Agreement): contractual commitment with consequences.
Error budget: 1 - SLO. Capacity to fail before customers care.
Toil: manual, repetitive, automatable, no enduring value. Should be < 50% of work.
Runbook: step-by-step response for an alert/incident.
Postmortem: blameless writeup after incident.

Picking SLIs

Categories:

Request-driven: availability (% non-5xx), latency (% requests < threshold), correctness.
Pipeline / batch: freshness, completeness, throughput.
Storage: durability, availability, latency.

Make SLIs reflect user experience, not internal metrics. CPU isn’t an SLI; “checkout request succeeds < 500ms” is.

SLO math

If SLO = 99.9% over 30 days:

Allowed downtime: 30d × 0.001 = 43.2 minutes.
Allowed errors: 0.1% of requests.

Error budget = 0.1% of requests OR 43 min. Burn it on risk taking; replenish each window.

When budget is nearly burned

Freeze risky changes.
Focus on reliability backlog.
Improve test coverage / canary length.
Investigate biggest contributors.

Burn-rate alerts (multi-window)

Don’t alert on raw error rate. Alert on how fast budget is being consumed.

Burn rate	Time to exhaust 30-day budget	Alert
14.4×	2 hours	page (within 5min)
6×	5 hours	page (within 30min)
1×	30 days	ticket

Use 2 windows (e.g. 5min + 1h) — both must trigger to reduce flapping.

Toil

Manual, repetitive, no engineering output, scales linearly with service.
Track time spent. Target < 50% of any SRE’s time.
Automate via runbooks → scripts → controllers → operators.

Reliability practices

Postmortems — blameless. Capture timeline, contributing factors (not “root cause”, multiple), action items.
Incident review meeting — share lessons.
Game days / chaos engineering — practice failures.
DiRT (Disaster Recovery Testing) — Google practice.
Capacity planning — load tests + headroom.
Auto-scaling + auto-healing — pets vs cattle.
Rollback path — every deploy must be reversible.

Severity levels

Sev	Customer impact	Response
1	major outage	page 24/7, all-hands
2	degraded	page during business
3	minor	ticket, fix in days
4	cosmetic	backlog

Alert fatigue

If on-call is paged > N times/week without action, alerts are noise.

Tune thresholds.
Auto-remediate where possible.
Move noisy ones to ticket-only.
Demand actionable runbook for every page-level alert.