Define the SLO target (e.g., 99.9% over 30 days) and derive the hourly error budget from the monthly budget
Create recording rules for short windows (5m, 30m) and long windows (1h, 6h) using rate() over your error-counter and request-counter metrics
Write four alerting rules pairing a fast burn window with a slow burn window per the Google SRE Workbook table: (1h+5m, 14.4x), (6h+30m, 6x), (3d+6h, 3x), (30d+6h, 1x)
Label the alerts with severity and page/ticket routing metadata and configure Alertmanager routes to route page-level alerts to PagerDuty and ticket-level to a webhook
Test the alert rules with promtool check rules and simulate a burn-rate spike using a test metric
Document the silence strategy so on-call engineers know how to defer non-critical burn-rate alerts without muting the fast-burn critical alert
Known gotchas
Using only a single burn-rate window produces too many false positives at low burn rates and misses slow burns that exhaust the budget over days; always pair at least two windows
The multiplication factors in the Google Workbook assume a 30-day compliance window; recalculate if your SLO window is 7 or 28 days
Alert rule evaluation intervals must be shorter than the smallest window (5m); set evaluation_interval to 1m in Prometheus to avoid delayed alerting
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp