SRE / SLO — Practical

SRE / SLO — Practical patterns

Defining an SLO (PromQL)

Availability SLI

# good = 2xx/3xx (and 4xx that aren't your fault)
sum(rate(http_requests_total{status=~"2..|3.."}[5m]))
  /
sum(rate(http_requests_total[5m]))

Latency SLI

# fraction of requests under 500ms
sum(rate(http_request_duration_seconds_bucket{le="0.5"}[5m]))
  /
sum(rate(http_request_duration_seconds_count[5m]))

Multi-window burn-rate alert

groups:
  - name: slo-burn
    rules:
      - alert: BurnRate2hFast
        expr: |
          ( 1 - (
            sum(rate(http_requests_total{status=~"2.."}[1h]))
            / sum(rate(http_requests_total[1h]))
          )) > 14.4 * (1 - 0.999)
          and
          ( 1 - (
            sum(rate(http_requests_total{status=~"2.."}[5m]))
            / sum(rate(http_requests_total[5m]))
          )) > 14.4 * (1 - 0.999)
        for: 2m
        labels: { severity: page }
        annotations:
          summary: "API SLO: burning budget at 14.4× — exhaust in 2h"
          runbook: https://wiki/runbooks/api-slo-burn

      - alert: BurnRate6hSlow
        expr: |
          ( 1 - (
            sum(rate(http_requests_total{status=~"2.."}[6h]))
            / sum(rate(http_requests_total[6h]))
          )) > 6 * (1 - 0.999)
        for: 15m
        labels: { severity: page }

Multi-window cuts false alarms. Source: Google SRE Workbook.

SLO recording rules (faster queries)

groups:
  - name: slo
    interval: 30s
    rules:
      - record: slo:request_availability:ratio_rate5m
        expr: |
          sum(rate(http_requests_total{status!~"5.."}[5m]))
          / sum(rate(http_requests_total[5m]))

Sloth (SLO generator)

version: prometheus/v1
service: api
slos:
  - name: requests-availability
    objective: 99.9
    description: API availability
    sli:
      events:
        error_query: sum(rate(http_requests_total{status=~"5.."}[{{.window}}]))
        total_query: sum(rate(http_requests_total[{{.window}}]))
    alerting:
      page_alert:   { labels: { severity: page } }
      ticket_alert: { labels: { severity: ticket } }

sloth generate -i slo.yaml -o prometheus-rules.yaml.

Runbook template

# Alert: API HighErrorRate

## Severity
P1 / page

## Summary
API 5xx rate > 2% for 10m.

## Diagnostic
1. Check Grafana "API – Errors by Route".
2. `kubectl logs -l app=api --tail=500 | grep ERROR`.
3. Check recent deploys: `kubectl rollout history deploy/api`.
4. Check downstream: DB CPU, Redis latency.

## Mitigations
- Recent deploy bug → `kubectl rollout undo deploy/api`.
- Downstream slow → enable read-replica fallback; disable risky feature flag.
- Saturation → `kubectl scale deploy/api --replicas=20`.

## Communication
- Update https://status.example.com.
- Post in #incident-X.

## Escalation
- DBA on-call if DB.
- Platform on-call if K8s.

Postmortem template

# Postmortem: API outage 2026-04-12

## Status
Resolved.

## Impact
- 23 minutes of partial outage (2026-04-12 14:03–14:26 UTC)
- 8% of /checkout requests returned 503
- Estimated affected: ~12k users
- Revenue impact: $45k

## Detection
SLO burn-rate alert at 14:05.

## Timeline (UTC)
- 14:03 — deploy of api@abc123 begins
- 14:05 — alert fires; on-call paged
- 14:11 — incident commander begins triage
- 14:18 — rollback initiated
- 14:26 — error rate normal

## Contributing factors
1. New code path didn't handle empty product list.
2. Canary stage too short to catch (only 5min).
3. No unit test for empty-cart edge case.

## What went well
- Burn-rate alert fired fast.
- Rollback < 8min from decision.

## What went poorly
- Canary didn't catch.
- 5min from alert to first action — runbook was unclear.

## Action items
- [P0, owner: jdoe, due: 2026-04-19] Add empty-cart unit tests.
- [P1, owner: smith, due: 2026-04-26] Extend canary to 30min minimum.
- [P1, owner: jdoe, due: 2026-04-26] Update runbook with one-line rollback command.

On-call rotation tips

Tools: PagerDuty, OpsGenie, GrafanaOnCall (free), incident.io.
Round-robin or follow-the-sun.
Pre-recorded handoff at shift change (notable items, ongoing).
Page only when human action needed within minutes.
“If it can be auto-remediated, automate it. If it can be fixed at business hours, ticket it.”

Game day exercise

Pick a service. Schedule 60min. Inject failure (kill a pod, throttle DB, fake slow downstream). Observe:

Did alerts fire?
Was the right team paged?
Was the runbook usable?
What recovery time?
Any cascading failures?

Document findings, fix gaps.

Common SLI ideas

Type	SLI
User-facing API	success rate, p99 latency
Pipeline	freshness (now − last record), throughput
Storage	durability events, availability of read
Queue worker	processing latency, retry rate
DB	query latency, replica lag, error rate
Search	query latency, results relevance regression
Auth	login success rate, token validation latency

Tools

Sloth, Pyrra, Slothfile — SLO config → Prometheus rules.
OpenSLO — vendor-neutral SLO spec.
Nobl9 — SLO management SaaS.
PagerDuty / OpsGenie / Grafana OnCall — paging.
incident.io / FireHydrant — incident management.
Confluence / Notion — runbooks.