Automation sensing layer

The sensing layer lets an automation read live system health and reason about durations - “open an incident if a system stays unhealthy for 30 minutes”, “page only when latency has been high for 10 minutes”. The template engine is strictly synchronous with no call syntax, so live state is never queried inline. Instead it is pre-resolved into scope once per evaluation and read as plain data.

State in scope

Before a run starts (and again on resume, and at the trigger-gate sites), the engine resolves live health state and folds it into scope under a health namespace. Templates and conditions then read it as ordinary data.

Path	Type	Meaning
`health.system`	object	State of the system named by the trigger’s context key
`health.system.status`	`"healthy" \| "degraded" \| "unhealthy"`	Aggregate status
`health.system.in_status_since`	string \| null	ISO timestamp the system entered its current status
`health.system.in_status_for_ms`	number	Milliseconds in the current status
`health.system.latency_ms`	number	Newest run latency
`health.system.avg_latency_ms`	number	Windowed average latency
`health.system.p95_latency_ms`	number	Windowed p95 latency
`health.system.success_rate`	number	Windowed success rate in [0, 1]
`health.system.in_maintenance`	boolean	Whether the system is in an active maintenance window
`health.system.transitions_in_window`	number	Status changes in the trailing window (generalized flapping count)
`health.system.transition_window_minutes`	number	The window (minutes) `transitions_in_window` was counted over
`health.systems[<id>]`	object	State of any system listed in `uses_state`

Resolution policy

By default the engine resolves only the system named by the trigger’s context key (one batched query, the common single-system case). To reason about other systems, list their ids in the automation’s uses_state field; they surface under health.systems[<id>].

triggers:
  - event: time.interval
    config: { seconds: 60 }
uses_state:
  - "payments-api"
  - "checkout-api"
conditions:
  - "health.systems['payments-api'].status == 'unhealthy'"

Trigger `for:` dwell

A trigger can declare a for: dwell - “fire only if the matched state still holds after this duration”. This is the precise, event-driven dwell that gates a trigger on sustained state, and it is restart-safe.

triggers:
  - event: healthcheck.system_degraded
    for: { minutes: 30 }   # only if still in the same status after 30 min
actions:
  - action: incident.create
    config:
      title: "{{ trigger.payload.systemName }} is critical"
      severity: critical
      systemIds: ["{{ trigger.payload.systemId }}"]

for: accepts a single-unit duration ({ seconds }, { minutes }, or { hours }) or { template } rendering to a number of seconds.

How it works:

When the trigger fires and its filter passes, the engine arms a row in automation_dwell_timers (unique on (automationId, triggerId, contextKey)), snapshotting the system’s current status, and enqueues an automation-dwell wake job with the matching delay. No run starts yet. Arming is idempotent: a re-fire while the dwell is still armed PRESERVES the original deadline rather than pushing it, so the window measures “continuously matched since first arm” (HA semantics). This is essential for continuously-firing triggers like a level-triggered numeric_state - pushing the deadline on every check would mean it never elapses. A genuine recover-then-recur deletes the row first (re-confirm / inverse-cancel), so a fresh window starts then.
At expiry the dwell re-confirms the system is STILL in the status it was in when armed (via the health-state provider). Only then does it evaluate the automation’s pre-run conditions and start the run. A recovery within the window therefore cancels the pending fire even without an explicit inverse event.
Cancellation is DB-side: deleting the dwell row makes the queue job no-op when it pops (queue jobs are not cancellable). A state-change event that contradicts an armed dwell (e.g. system.healthy after a system.degraded dwell on the same system) eagerly deletes the row.

The dwell row is the source of truth; the queue job is just the wake signal. If the job is lost (process restart, queue backend loss), the stalled sweeper catches the expired row and fires it. Both paths are idempotent via delete-on-fire, so a dwell fires at most once.

numeric_state trigger

The built-in numeric_state trigger fires off healthcheck.check.completed when a numeric field crosses an above / below threshold. Pair it with a trigger-level for: for “above X for Y minutes”.

triggers:
  - event: automation.numeric_state
    config:
      field: p95LatencyMs      # or latencyMs, or collectors.<id>.<field>
      above: 500
    for: { minutes: 10 }
actions:
  - action: notification.send
    config: { title: "p95 latency high", body: "{{ trigger.payload.systemId }}" }

field supports latencyMs (top-level), p95LatencyMs, and dotted collector paths like collectors.http.responseTimeMs (resolved under result.metadata).
above / below are strict bounds; at least one is required. With both, the value must fall in the open band between them.
The threshold is enforced per-automation by a structured config gate that runs before the operator’s template filter - so the trigger only fires for automations whose above / below the completed check actually crossed.

Flapping and the windowed transition count

For rules that need the count as a NUMBER in scope (e.g. to branch a choose on how badly a system is flapping, or to combine with other conditions), the health provider also folds a windowed transition count into scope. health.system.transitions_in_window is the number of aggregate status changes for the system over the trailing window; the window defaults to 60 minutes and is set per-automation via the top-level state_window_minutes. Counting all aggregate transitions (not just unhealthy) is a superset of the unhealthy-transition flapping count.

Author such a rule as a numeric_state condition over that field - no new condition variant, no editor change:

triggers:
  - event: healthcheck.system_health_changed
state_window_minutes: 30
conditions:
  - numeric_state:
      value: "health.system.transitions_in_window"
      above: 4          # 5+ status changes in the last 30 min
actions:
  - action: incident.create
    config: { title: "{{ trigger.payload.systemId }} is flapping", severity: warning }

Duration filters

Duration helpers are pure, synchronous template filters - transforms over already-resolved values, never database calls. They compute against real time at call time, so “now” is fresh per evaluation rather than the frozen run-start timestamp.

Filter	Form	Result
`minutes`	`30 \| minutes`	A number of minutes as milliseconds
`hours`	`2 \| hours`	A number of hours as milliseconds
`duration_since`	`iso \| duration_since`	Milliseconds elapsed since an ISO timestamp
`older_than`	`iso \| older_than(thresholdMs)`	True when the timestamp is at least `thresholdMs` in the past

A duration-aware condition reads the pre-resolved in_status_since and compares it with the duration filters:

// "unhealthy for at least 30 minutes"
health.system.status == 'unhealthy' && (health.system.in_status_since | older_than(30 | minutes))

duration_since returns 0 for null / unparseable input (never negative); older_than returns false for an unknown timestamp (an unknown age is never “older than” a threshold).

Filter extension point

Plugins can contribute their own pure filters through automationFilterExtensionPoint. Filters MUST be pure and synchronous - no I/O, no async, no database access - because the engine evaluates them inline during rendering.

import { automationFilterExtensionPoint } from "@checkstack/automation-backend";

const ext = env.getExtensionPoint(automationFilterExtensionPoint);
ext.registerFilter(
  {
    name: "percent",
    signature: "percent(decimals)",
    description: "Format a 0-1 ratio as a percentage string.",
    filter: (value, decimals) => {
      const n = typeof value === "number" ? value : Number(value);
      if (!Number.isFinite(n)) return value;
      const d = typeof decimals === "number" ? decimals : 0;
      return `${(n * 100).toFixed(d)}%`;
    },
  },
  pluginMetadata,
);

A filter whose name collides with a built-in is skipped with a warning rather than overwriting the built-in.

Structured conditions

Beyond raw template strings and the and / or / not combinators, conditions support three typed variants. Each evaluates over the pre-resolved scope plus a fresh now (the time variant recomputes now per evaluation - never the frozen scope timestamp). The raw template string stays the escape hatch for anything these don’t cover.

numeric_state

Compare a numeric value (a literal number, or a template/path string resolved against scope) to above / below bounds.

conditions:
  - numeric_state:
      value: "health.system.p95_latency_ms"
      above: 500

time

On-call / quiet-hours gating. after / before are HH:mm (24h) local to timezone (IANA, defaults to UTC); weekday is a list of 0-6 (Sunday = 0). An after greater than before is treated as an overnight window wrapping midnight.

conditions:
  - time:
      after: "22:00"
      before: "06:00"
      weekday: [1, 2, 3, 4, 5]
      timezone: "Europe/Berlin"

state

A condition-side dwell: true when entity (a system id) is in status, optionally held for at least for. It reads the pre-resolved health.systems[entity].in_status_for_ms - no new timer, it reads the duration the provider already computed.

conditions:
  - state:
      entity: "payments-api"
      status: unhealthy
      for: { minutes: 30 }

wait_until action

wait_until suspends a running automation until a condition becomes true, with an optional timeout. It is the condition counterpart to wait_for_trigger (which waits for an event): instead of waiting for something to happen, it waits for live state to satisfy a condition. It is fully reactive - the suspended run is woken by a relevant entity change, never re-checked on a timer.

actions:
  - action: incident.create
    config: { title: "{{ trigger.payload.systemName }} down", severity: critical }
  - wait_until:
      condition: "health.system.status == 'healthy'"
      timeout_seconds: 3600       # wait up to 1h
      continue_on_timeout: true   # default; false = fail the run on timeout
  - action: incident.resolve
    config: { incidentId: "{{ artifacts.incident.id }}" }

condition accepts any condition shape - a template string or a structured numeric_state / time / state variant.
If the condition is ALREADY true when reached, the run continues inline without suspending.
Otherwise the run suspends with a durable kind: "until" wait lock carrying the condition + timeout policy. At suspend time the engine extracts the state.* refs the condition reads and inserts wake-index rows keyed by ${kind}:${id}, plus a single durable timeout timer at the deadline. A relevant entity change wakes the run, re-enriches scope, and re-evaluates the full condition: true resumes the run; the timeout deadline resumes (continue) or fails per continue_on_timeout (default true, matching HA’s wait_template); still-false stays suspended.
Works nested inside choose / parallel / repeat - the engine resumes through the same remainder mechanism as every other suspend.

Like every suspend, wait_until survives a restart: the wait lock + wake-index rows are the source of truth, an ENTITY_CHANGED is the wake signal, and the stalled sweeper applies the timeout policy as a backstop if the timer job is lost. Resumes take the per-run advisory lock, so a wake and a sweep can’t double-resume. The poll_seconds field is now inert (waits are woken by change, not polled). For the full mechanism, see the reactive dispatch pipeline.