Skip to content

SRE / SLO — Practical

# good = 2xx/3xx (and 4xx that aren't your fault)
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
/
sum(rate(http_requests_total[5m]))
# fraction of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
/
sum(rate(http_request_duration_seconds_count[5m]))
groups:
- name: slo-burn
rules:
- alert: BurnRate2hFast
expr: |
( 1 - (
sum(rate(http_requests_total{status=~"2.."}[1h]))
/ sum(rate(http_requests_total[1h]))
)) > 14.4 * (1 - 0.999)
and
( 1 - (
sum(rate(http_requests_total{status=~"2.."}[5m]))
/ sum(rate(http_requests_total[5m]))
)) > 14.4 * (1 - 0.999)
for: 2m
labels: { severity: page }
annotations:
summary: "API SLO: burning budget at 14.4× — exhaust in 2h"
runbook: https://wiki/runbooks/api-slo-burn
- alert: BurnRate6hSlow
expr: |
( 1 - (
sum(rate(http_requests_total{status=~"2.."}[6h]))
/ sum(rate(http_requests_total[6h]))
)) > 6 * (1 - 0.999)
for: 15m
labels: { severity: page }

Multi-window cuts false alarms. Source: Google SRE Workbook.

groups:
- name: slo
interval: 30s
rules:
- record: slo:request_availability:ratio_rate5m
expr: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
version: prometheus/v1
service: api
slos:
- name: requests-availability
objective: 99.9
description: API availability
sli:
events:
error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
total_query: sum(rate(http_requests_total[{{.window}}]))
alerting:
page_alert: { labels: { severity: page } }
ticket_alert: { labels: { severity: ticket } }

sloth generate -i slo.yaml -o prometheus-rules.yaml.

# Alert: API HighErrorRate
## Severity
P1 / page
## Summary
API 5xx rate > 2% for 10m.
## Diagnostic
1. Check Grafana "API – Errors by Route".
2. `kubectl logs -l app=api --tail=500 | grep ERROR`.
3. Check recent deploys: `kubectl rollout history deploy/api`.
4. Check downstream: DB CPU, Redis latency.
## Mitigations
- Recent deploy bug → `kubectl rollout undo deploy/api`.
- Downstream slow → enable read-replica fallback; disable risky feature flag.
- Saturation → `kubectl scale deploy/api --replicas=20`.
## Communication
- Update https://status.example.com.
- Post in #incident-X.
## Escalation
- DBA on-call if DB.
- Platform on-call if K8s.
# Postmortem: API outage 2026-04-12
## Status
Resolved.
## Impact
- 23 minutes of partial outage (2026-04-12 14:03–14:26 UTC)
- 8% of /checkout requests returned 503
- Estimated affected: ~12k users
- Revenue impact: $45k
## Detection
SLO burn-rate alert at 14:05.
## Timeline (UTC)
- 14:03 — deploy of api@abc123 begins
- 14:05 — alert fires; on-call paged
- 14:11 — incident commander begins triage
- 14:18 — rollback initiated
- 14:26 — error rate normal
## Contributing factors
1. New code path didn't handle empty product list.
2. Canary stage too short to catch (only 5min).
3. No unit test for empty-cart edge case.
## What went well
- Burn-rate alert fired fast.
- Rollback < 8min from decision.
## What went poorly
- Canary didn't catch.
- 5min from alert to first action — runbook was unclear.
## Action items
- [P0, owner: jdoe, due: 2026-04-19] Add empty-cart unit tests.
- [P1, owner: smith, due: 2026-04-26] Extend canary to 30min minimum.
- [P1, owner: jdoe, due: 2026-04-26] Update runbook with one-line rollback command.
  • Tools: PagerDuty, OpsGenie, GrafanaOnCall (free), incident.io.
  • Round-robin or follow-the-sun.
  • Pre-recorded handoff at shift change (notable items, ongoing).
  • Page only when human action needed within minutes.
  • “If it can be auto-remediated, automate it. If it can be fixed at business hours, ticket it.”

Pick a service. Schedule 60min. Inject failure (kill a pod, throttle DB, fake slow downstream). Observe:

  • Did alerts fire?
  • Was the right team paged?
  • Was the runbook usable?
  • What recovery time?
  • Any cascading failures?

Document findings, fix gaps.

TypeSLI
User-facing APIsuccess rate, p99 latency
Pipelinefreshness (now − last record), throughput
Storagedurability events, availability of read
Queue workerprocessing latency, retry rate
DBquery latency, replica lag, error rate
Searchquery latency, results relevance regression
Authlogin success rate, token validation latency
  • Sloth, Pyrra, Slothfile — SLO config → Prometheus rules.
  • OpenSLO — vendor-neutral SLO spec.
  • Nobl9 — SLO management SaaS.
  • PagerDuty / OpsGenie / Grafana OnCall — paging.
  • incident.io / FireHydrant — incident management.
  • Confluence / Notion — runbooks.