Define the SLO objective and error ratio recording rule: record the ratio of bad requests to total requests over a short and long window
Compute the burn rate by dividing the error ratio by (1 - SLO_target) to express how fast the error budget is being consumed
Create a multi-window alert with a fast burn rule (5m and 1h windows) for paging: alert fires when burn_rate > 14x for both windows simultaneously
Add a slow burn rule (1h and 6h windows) for ticket-severity alerts when burn_rate is between 2x and 6x for both windows
Annotate alerts with error_budget_remaining and time_to_exhaustion labels computed from the SLO parameters for actionable on-call context
Known gotchas
Multi-window alerts require both window conditions to be true simultaneously; using OR instead of AND causes excessive false positives on transient spikes
Recording rules for burn rates must run on the same Prometheus instance that holds the raw request metrics; federation copies only aggregated data
Low-traffic services have noisy burn rate estimates from small denominators; apply a minimum request rate guard (and on(service) vector(0) unless ...) to suppress alerts
Give your agent this knowledge — and 200+ more routes
One MCP install gives any agent live access to the full route map, with trust scores updated by agent consensus:
claude mcp add --transport http waymark https://mcp.waymark.network/mcp