Automation sensing layer
The sensing layer lets an automation read live system health and reason about durations - “open an incident if a system stays unhealthy for 30 minutes”, “page only when latency has been high for 10 minutes”. The template engine is strictly synchronous with no call syntax, so live state is never queried inline. Instead it is pre-resolved into scope once per evaluation and read as plain data.
State in scope
Section titled “State in scope”Before a run starts (and again on resume, and at the trigger-gate sites), the engine resolves live health state and folds it into scope under a health namespace. Templates and conditions then read it as ordinary data.
| Path | Type | Meaning |
|---|---|---|
health.system | object | State of the system named by the trigger’s context key |
health.system.status | "healthy" | "degraded" | "unhealthy" | Aggregate status |
health.system.in_status_since | string | null | ISO timestamp the system entered its current status |
health.system.in_status_for_ms | number | Milliseconds in the current status |
health.system.latency_ms | number | Newest run latency |
health.system.avg_latency_ms | number | Windowed average latency |
health.system.p95_latency_ms | number | Windowed p95 latency |
health.system.success_rate | number | Windowed success rate in [0, 1] |
health.system.in_maintenance | boolean | Whether the system is in an active maintenance window |
health.system.transitions_in_window | number | Status changes in the trailing window (generalized flapping count) |
health.system.transition_window_minutes | number | The window (minutes) transitions_in_window was counted over |
health.systems[<id>] | object | State of any system listed in uses_state |
Resolution policy
Section titled “Resolution policy”By default the engine resolves only the system named by the trigger’s context key (one batched query, the common single-system case). To reason about other systems, list their ids in the automation’s uses_state field; they surface under health.systems[<id>].
triggers: - event: time.interval config: { seconds: 60 }uses_state: - "payments-api" - "checkout-api"conditions: - "health.systems['payments-api'].status == 'unhealthy'"Trigger for: dwell
Section titled “Trigger for: dwell”A trigger can declare a for: dwell - “fire only if the matched state still holds after this duration”. This is the precise, event-driven dwell that gates a trigger on sustained state, and it is restart-safe.
triggers: - event: healthcheck.system_degraded for: { minutes: 30 } # only if still in the same status after 30 minactions: - action: incident.create config: title: "{{ trigger.payload.systemName }} is critical" severity: critical systemIds: ["{{ trigger.payload.systemId }}"]for: accepts a single-unit duration ({ seconds }, { minutes }, or { hours }) or { template } rendering to a number of seconds.
How it works:
- When the trigger fires and its
filterpasses, the engine arms a row inautomation_dwell_timers(unique on(automationId, triggerId, contextKey)), snapshotting the system’s current status, and enqueues anautomation-dwellwake job with the matching delay. No run starts yet. Arming is idempotent: a re-fire while the dwell is still armed PRESERVES the original deadline rather than pushing it, so the window measures “continuously matched since first arm” (HA semantics). This is essential for continuously-firing triggers like a level-triggerednumeric_state- pushing the deadline on every check would mean it never elapses. A genuine recover-then-recur deletes the row first (re-confirm / inverse-cancel), so a fresh window starts then. - At expiry the dwell re-confirms the system is STILL in the status it was in when armed (via the health-state provider). Only then does it evaluate the automation’s pre-run conditions and start the run. A recovery within the window therefore cancels the pending fire even without an explicit inverse event.
- Cancellation is DB-side: deleting the dwell row makes the queue job no-op when it pops (queue jobs are not cancellable). A state-change event that contradicts an armed dwell (e.g.
system.healthyafter asystem.degradeddwell on the same system) eagerly deletes the row.
The dwell row is the source of truth; the queue job is just the wake signal. If the job is lost (process restart, queue backend loss), the stalled sweeper catches the expired row and fires it. Both paths are idempotent via delete-on-fire, so a dwell fires at most once.
numeric_state trigger
Section titled “numeric_state trigger”The built-in numeric_state trigger fires off healthcheck.check.completed when a numeric field crosses an above / below threshold. Pair it with a trigger-level for: for “above X for Y minutes”.
triggers: - event: automation.numeric_state config: field: p95LatencyMs # or latencyMs, or collectors.<id>.<field> above: 500 for: { minutes: 10 }actions: - action: notification.send config: { title: "p95 latency high", body: "{{ trigger.payload.systemId }}" }fieldsupportslatencyMs(top-level),p95LatencyMs, and dotted collector paths likecollectors.http.responseTimeMs(resolved underresult.metadata).above/beloware strict bounds; at least one is required. With both, the value must fall in the open band between them.- The threshold is enforced per-automation by a structured config gate that runs before the operator’s template
filter- so the trigger only fires for automations whoseabove/belowthe completed check actually crossed.
Flapping and the windowed transition count
Section titled “Flapping and the windowed transition count”For rules that need the count as a NUMBER in scope (e.g. to branch a choose on how badly a system is flapping, or to combine with other conditions), the health provider also folds a windowed transition count into scope. health.system.transitions_in_window is the number of aggregate status changes for the system over the trailing window; the window defaults to 60 minutes and is set per-automation via the top-level state_window_minutes. Counting all aggregate transitions (not just unhealthy) is a superset of the unhealthy-transition flapping count.
Author such a rule as a numeric_state condition over that field - no new condition variant, no editor change:
triggers: - event: healthcheck.system_health_changedstate_window_minutes: 30conditions: - numeric_state: value: "health.system.transitions_in_window" above: 4 # 5+ status changes in the last 30 minactions: - action: incident.create config: { title: "{{ trigger.payload.systemId }} is flapping", severity: warning }Duration filters
Section titled “Duration filters”Duration helpers are pure, synchronous template filters - transforms over already-resolved values, never database calls. They compute against real time at call time, so “now” is fresh per evaluation rather than the frozen run-start timestamp.
| Filter | Form | Result |
|---|---|---|
minutes | 30 | minutes | A number of minutes as milliseconds |
hours | 2 | hours | A number of hours as milliseconds |
duration_since | iso | duration_since | Milliseconds elapsed since an ISO timestamp |
older_than | iso | older_than(thresholdMs) | True when the timestamp is at least thresholdMs in the past |
A duration-aware condition reads the pre-resolved in_status_since and compares it with the duration filters:
// "unhealthy for at least 30 minutes"health.system.status == 'unhealthy' && (health.system.in_status_since | older_than(30 | minutes))duration_since returns 0 for null / unparseable input (never negative); older_than returns false for an unknown timestamp (an unknown age is never “older than” a threshold).
Filter extension point
Section titled “Filter extension point”Plugins can contribute their own pure filters through automationFilterExtensionPoint. Filters MUST be pure and synchronous - no I/O, no async, no database access - because the engine evaluates them inline during rendering.
import { automationFilterExtensionPoint } from "@checkstack/automation-backend";
const ext = env.getExtensionPoint(automationFilterExtensionPoint);ext.registerFilter( { name: "percent", signature: "percent(decimals)", description: "Format a 0-1 ratio as a percentage string.", filter: (value, decimals) => { const n = typeof value === "number" ? value : Number(value); if (!Number.isFinite(n)) return value; const d = typeof decimals === "number" ? decimals : 0; return `${(n * 100).toFixed(d)}%`; }, }, pluginMetadata,);A filter whose name collides with a built-in is skipped with a warning rather than overwriting the built-in.
Structured conditions
Section titled “Structured conditions”Beyond raw template strings and the and / or / not combinators, conditions support three typed variants. Each evaluates over the pre-resolved scope plus a fresh now (the time variant recomputes now per evaluation - never the frozen scope timestamp). The raw template string stays the escape hatch for anything these don’t cover.
numeric_state
Section titled “numeric_state”Compare a numeric value (a literal number, or a template/path string resolved against scope) to above / below bounds.
conditions: - numeric_state: value: "health.system.p95_latency_ms" above: 500On-call / quiet-hours gating. after / before are HH:mm (24h) local to timezone (IANA, defaults to UTC); weekday is a list of 0-6 (Sunday = 0). An after greater than before is treated as an overnight window wrapping midnight.
conditions: - time: after: "22:00" before: "06:00" weekday: [1, 2, 3, 4, 5] timezone: "Europe/Berlin"A condition-side dwell: true when entity (a system id) is in status, optionally held for at least for. It reads the pre-resolved health.systems[entity].in_status_for_ms - no new timer, it reads the duration the provider already computed.
conditions: - state: entity: "payments-api" status: unhealthy for: { minutes: 30 }wait_until action
Section titled “wait_until action”wait_until suspends a running automation until a condition becomes true, with an optional timeout. It is the condition counterpart to wait_for_trigger (which waits for an event): instead of waiting for something to happen, it waits for live state to satisfy a condition. It is fully reactive - the suspended run is woken by a relevant entity change, never re-checked on a timer.
actions: - action: incident.create config: { title: "{{ trigger.payload.systemName }} down", severity: critical } - wait_until: condition: "health.system.status == 'healthy'" timeout_seconds: 3600 # wait up to 1h continue_on_timeout: true # default; false = fail the run on timeout - action: incident.resolve config: { incidentId: "{{ artifacts.incident.id }}" }conditionaccepts any condition shape - a template string or a structurednumeric_state/time/statevariant.- If the condition is ALREADY true when reached, the run continues inline without suspending.
- Otherwise the run suspends with a durable
kind: "until"wait lock carrying the condition + timeout policy. At suspend time the engine extracts thestate.*refs the condition reads and inserts wake-index rows keyed by${kind}:${id}, plus a single durable timeout timer at the deadline. A relevant entity change wakes the run, re-enriches scope, and re-evaluates the full condition: true resumes the run; the timeout deadline resumes (continue) or fails percontinue_on_timeout(default true, matching HA’swait_template); still-false stays suspended. - Works nested inside
choose/parallel/repeat- the engine resumes through the same remainder mechanism as every other suspend.
Like every suspend, wait_until survives a restart: the wait lock + wake-index rows are the source of truth, an ENTITY_CHANGED is the wake signal, and the stalled sweeper applies the timeout policy as a backstop if the timer job is lost. Resumes take the per-run advisory lock, so a wake and a sweep can’t double-resume. The poll_seconds field is now inert (waits are woken by change, not polled). For the full mechanism, see the reactive dispatch pipeline.