Lookout
000 015 030 045 060 075 090 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360
2 min read Tom Shafer

SLOs and error budgets

Defining service level objectives and tracking attainment, error budget, and burn rate from existing uptime checks and request traces.

Next: SLOs and error budgets, in a new Quality section.

"Is the app healthy?" is a vibe. An SLO turns it into a number you can argue with: 99.9% of requests succeed, measured over 30 days. Below that, you've spent your error budget and it's time to slow down and stabilize. Above it, you have budget to spend on shipping.

What I built

  • ServiceLevelObjective — define an objective: an SLI (uptime availability, HTTP availability, or HTTP latency under a threshold), a target percentage, and a window.
  • SloAttainmentEvaluator — computes attainment, error budget remaining, and burn rate from data already on hand: synthetic uptime checks and inbound request traces. No new collection — it reuses the uptime monitors and the http.server spans.
  • Burn-rate alerting — when you're burning budget fast enough to blow the window, it fires through the alert engine (a new slo.burn_rate event — and notice the engine needed zero changes to support it).

Error budget is the useful part

Attainment is a scoreboard. The error budget is a decision tool. "We're at 99.95% against a 99.9% target with 60% of the budget left" tells a team they can ship. "Budget's gone and we're four days into the window" tells them to freeze and fix. It reframes reliability from a binary (up/down) into a resource you manage — which is the whole point of SRE.

Two more pieces rounded out this stretch: PII redaction and retention and anomaly detection.

build-in-public slo reliability