Lookout
000 015 030 045 060 075 090 105 120 135 150 165 180 195 210 225 240 255 270 285 300 315 330 345 360
3 min read Tom Shafer Deep dive

An escalation state machine with signed-link acknowledgement

A deep dive on building native on-call escalation in Laravel — the incident state machine, a per-minute scheduler that advances tiers, repeat-until-acknowledged, incident dedup, and stopping escalation with a signed URL.

PagerDuty and Opsgenie own escalation when you use them. Plenty of small teams don't, and for them a single Slack ping everyone ignores isn't on-call — it's hope. This is a deep dive into a lightweight escalation engine built right into the app.

The data model

Three tables:

  • escalation_policies — org-scoped (optionally project-scoped), a set of event_keys it covers, and a repeat_count.
  • escalation_policy_steps — ordered tiers, each with delay_minutes and a set of channel ids.
  • escalation_incidents — a live run of a policy: status, current_step, repeats_done, and a next_run_at timestamp.

The incident is a small state machine: open → acknowledged | resolved | exhausted. The whole engine is just "advance open incidents whose next step is due, until something stops them."

Trigger: open an incident, fire step one

When a routed event fires, the manager opens an incident and immediately processes the first step. The dedup here is its own table, not the alert cache: don't open a second incident for a (policy, dedup_key) that already has one active or one created within the window. Otherwise a flapping alert spawns incidents endlessly.

The advance loop

A per-minute command — lookout:process-escalations — drives everything:

EscalationIncident::query()
    ->where('status', 'open')
    ->whereNotNull('next_run_at')
    ->where('next_run_at', '<=', now())
    ->each(fn ($incident) => $this->processIncident($incident));

processIncident is the heart of the state machine. Fire the next step's channels, then schedule the one after it:

$next = $incident->current_step + 1;

if ($next >= $steps->count()) {                 // ran out of steps
    if ($incident->repeats_done < $policy->repeat_count) {
        $incident->repeats_done++;              // loop the whole sequence
        $next = 0;
    } else {
        return $incident->update(['next_run_at' => null]); // done; await ack
    }
}

$this->fireStep($incident, $steps[$next]);
$followsAt = $steps[$next + 1]?->delay_minutes;
$incident->update([
    'current_step' => $next,
    'next_run_at'  => $followsAt ? now()->addMinutes($followsAt) : /* repeat or null */,
]);

Three behaviors fall out of that one function: tiered delays (each step schedules the next by its delay), repeat-until-acknowledged (loop back to step 0 while repeats_done < repeat_count), and natural termination (next_run_at = null parks the incident, still open, awaiting a human).

Stopping it: a signed URL

Every escalation message carries an acknowledge link. The trick is it needs no login — a responder on their phone at 2am shouldn't hit an auth wall. Laravel signed URLs are exactly right: the signature is the proof.

URL::signedRoute('escalations.ack', ['incident' => $incident->id]);

The route sits behind the signed middleware, so a tampered or unsigned link is rejected automatically. Acknowledging flips status to acknowledged and nulls next_run_at — the advance loop skips it forever after. There's an in-app ack on the live incident list too, but the signed link is the one that actually gets used.

The design call: don't over-build

The interesting decision was restraint. PagerDuty exists and is excellent; I'm not out to clone it. No on-call calendars, no rotations, no override schedules. Just tiers, delays, repeat, and ack — for teams whose current alternative is nothing. It reuses the existing alert channels and notifier and only adds the "keep going until someone responds" loop on top. Right-sized beats feature-complete.

One capstone feature left: getting errors off the web and onto phones with mobile SDKs.

deep-dive escalation on-call alerting laravel