Skip to content

SRE / SLO — Basics

“Class SRE implements interface DevOps.”

SRE is the engineering practice that takes operating production seriously: applies software engineering to ops. Born at Google, popularized via Site Reliability Engineering book.

  • SLI (Service Level Indicator): a measurable signal about service health.
  • SLO (Service Level Objective): target for an SLI over a window.
  • SLA (Service Level Agreement): contractual commitment with consequences.
  • Error budget: 1 - SLO. Capacity to fail before customers care.
  • Toil: manual, repetitive, automatable, no enduring value. Should be < 50% of work.
  • Runbook: step-by-step response for an alert/incident.
  • Postmortem: blameless writeup after incident.

Categories:

  • Request-driven: availability (% non-5xx), latency (% requests < threshold), correctness.
  • Pipeline / batch: freshness, completeness, throughput.
  • Storage: durability, availability, latency.

Make SLIs reflect user experience, not internal metrics. CPU isn’t an SLI; “checkout request succeeds < 500ms” is.

If SLO = 99.9% over 30 days:

  • Allowed downtime: 30d × 0.001 = 43.2 minutes.
  • Allowed errors: 0.1% of requests.

Error budget = 0.1% of requests OR 43 min. Burn it on risk taking; replenish each window.

  • Freeze risky changes.
  • Focus on reliability backlog.
  • Improve test coverage / canary length.
  • Investigate biggest contributors.

Don’t alert on raw error rate. Alert on how fast budget is being consumed.

Burn rateTime to exhaust 30-day budgetAlert
14.4×2 hourspage (within 5min)
5 hourspage (within 30min)
30 daysticket

Use 2 windows (e.g. 5min + 1h) — both must trigger to reduce flapping.

  • Manual, repetitive, no engineering output, scales linearly with service.
  • Track time spent. Target < 50% of any SRE’s time.
  • Automate via runbooks → scripts → controllers → operators.
  • Postmortems — blameless. Capture timeline, contributing factors (not “root cause”, multiple), action items.
  • Incident review meeting — share lessons.
  • Game days / chaos engineering — practice failures.
  • DiRT (Disaster Recovery Testing) — Google practice.
  • Capacity planning — load tests + headroom.
  • Auto-scaling + auto-healing — pets vs cattle.
  • Rollback path — every deploy must be reversible.
SevCustomer impactResponse
1major outagepage 24/7, all-hands
2degradedpage during business
3minorticket, fix in days
4cosmeticbacklog

If on-call is paged > N times/week without action, alerts are noise.

  • Tune thresholds.
  • Auto-remediate where possible.
  • Move noisy ones to ticket-only.
  • Demand actionable runbook for every page-level alert.