SRE / SLO — Practical
SRE / SLO — Practical patterns
Section titled “SRE / SLO — Practical patterns”Defining an SLO (PromQL)
Section titled “Defining an SLO (PromQL)”Availability SLI
Section titled “Availability SLI”# good = 2xx/3xx (and 4xx that aren't your fault)sum(rate(http_requests_total{status=~"2..|3.."}[5m])) /sum(rate(http_requests_total[5m]))Latency SLI
Section titled “Latency SLI”# fraction of requests under 500mssum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m])) /sum(rate(http_request_duration_seconds_count[5m]))Multi-window burn-rate alert
Section titled “Multi-window burn-rate alert”groups: - name: slo-burn rules: - alert: BurnRate2hFast expr: | ( 1 - ( sum(rate(http_requests_total{status=~"2.."}[1h])) / sum(rate(http_requests_total[1h])) )) > 14.4 * (1 - 0.999) and ( 1 - ( sum(rate(http_requests_total{status=~"2.."}[5m])) / sum(rate(http_requests_total[5m])) )) > 14.4 * (1 - 0.999) for: 2m labels: { severity: page } annotations: summary: "API SLO: burning budget at 14.4× — exhaust in 2h" runbook: https://wiki/runbooks/api-slo-burn
- alert: BurnRate6hSlow expr: | ( 1 - ( sum(rate(http_requests_total{status=~"2.."}[6h])) / sum(rate(http_requests_total[6h])) )) > 6 * (1 - 0.999) for: 15m labels: { severity: page }Multi-window cuts false alarms. Source: Google SRE Workbook.
SLO recording rules (faster queries)
Section titled “SLO recording rules (faster queries)”groups: - name: slo interval: 30s rules: - record: slo:request_availability:ratio_rate5m expr: | sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m]))Sloth (SLO generator)
Section titled “Sloth (SLO generator)”version: prometheus/v1service: apislos: - name: requests-availability objective: 99.9 description: API availability sli: events: error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}])) total_query: sum(rate(http_requests_total[{{.window}}])) alerting: page_alert: { labels: { severity: page } } ticket_alert: { labels: { severity: ticket } }sloth generate -i slo.yaml -o prometheus-rules.yaml.
Runbook template
Section titled “Runbook template”# Alert: API HighErrorRate
## SeverityP1 / page
## SummaryAPI 5xx rate > 2% for 10m.
## Diagnostic1. Check Grafana "API – Errors by Route".2. `kubectl logs -l app=api --tail=500 | grep ERROR`.3. Check recent deploys: `kubectl rollout history deploy/api`.4. Check downstream: DB CPU, Redis latency.
## Mitigations- Recent deploy bug → `kubectl rollout undo deploy/api`.- Downstream slow → enable read-replica fallback; disable risky feature flag.- Saturation → `kubectl scale deploy/api --replicas=20`.
## Communication- Update https://status.example.com.- Post in #incident-X.
## Escalation- DBA on-call if DB.- Platform on-call if K8s.Postmortem template
Section titled “Postmortem template”# Postmortem: API outage 2026-04-12
## StatusResolved.
## Impact- 23 minutes of partial outage (2026-04-12 14:03–14:26 UTC)- 8% of /checkout requests returned 503- Estimated affected: ~12k users- Revenue impact: $45k
## DetectionSLO burn-rate alert at 14:05.
## Timeline (UTC)- 14:03 — deploy of api@abc123 begins- 14:05 — alert fires; on-call paged- 14:11 — incident commander begins triage- 14:18 — rollback initiated- 14:26 — error rate normal
## Contributing factors1. New code path didn't handle empty product list.2. Canary stage too short to catch (only 5min).3. No unit test for empty-cart edge case.
## What went well- Burn-rate alert fired fast.- Rollback < 8min from decision.
## What went poorly- Canary didn't catch.- 5min from alert to first action — runbook was unclear.
## Action items- [P0, owner: jdoe, due: 2026-04-19] Add empty-cart unit tests.- [P1, owner: smith, due: 2026-04-26] Extend canary to 30min minimum.- [P1, owner: jdoe, due: 2026-04-26] Update runbook with one-line rollback command.On-call rotation tips
Section titled “On-call rotation tips”- Tools: PagerDuty, OpsGenie, GrafanaOnCall (free), incident.io.
- Round-robin or follow-the-sun.
- Pre-recorded handoff at shift change (notable items, ongoing).
- Page only when human action needed within minutes.
- “If it can be auto-remediated, automate it. If it can be fixed at business hours, ticket it.”
Game day exercise
Section titled “Game day exercise”Pick a service. Schedule 60min. Inject failure (kill a pod, throttle DB, fake slow downstream). Observe:
- Did alerts fire?
- Was the right team paged?
- Was the runbook usable?
- What recovery time?
- Any cascading failures?
Document findings, fix gaps.
Common SLI ideas
Section titled “Common SLI ideas”| Type | SLI |
|---|---|
| User-facing API | success rate, p99 latency |
| Pipeline | freshness (now − last record), throughput |
| Storage | durability events, availability of read |
| Queue worker | processing latency, retry rate |
| DB | query latency, replica lag, error rate |
| Search | query latency, results relevance regression |
| Auth | login success rate, token validation latency |
- Sloth, Pyrra, Slothfile — SLO config → Prometheus rules.
- OpenSLO — vendor-neutral SLO spec.
- Nobl9 — SLO management SaaS.
- PagerDuty / OpsGenie / Grafana OnCall — paging.
- incident.io / FireHydrant — incident management.
- Confluence / Notion — runbooks.