Skip to content

Automation primitives reference

This page is the canonical per-primitive reference for an automation definition. Every action, trigger, and condition is listed with its shape and a runnable YAML snippet. The visual editor and the YAML view round-trip losslessly, so anything here is also editable in the builder. For the runtime semantics behind these shapes, see the automation platform overview and the sensing layer.

An automation definition is triggers + optional conditions + ordered actions, plus run-control fields. Actions are discriminated by which key is present (action, choose, parallel, delay, repeat, variables, condition, stop, wait_for_trigger, wait_until, sequence).

name: "Page on sustained latency"
mode: single # single | parallel | queued | restart
concurrency_scope: context # automation | context
max_runs: 10 # queue depth for mode: queued
triggers:
- event: automation.numeric_state
config: { field: p95LatencyMs, above: 500 }
for: { minutes: 10 }
conditions:
- time:
after: "08:00"
before: "20:00"
actions:
- action: notification.send
config: { title: "p95 latency high", body: "{{ trigger.payload.systemId }}" }

Every action also accepts these shared fields:

  • id - a stable identifier, used to reference the action’s artifacts (artifacts.<id>.<name>) and to read it in run logs. Auto-filled if left blank.
  • description - an optional operator note.
  • enabled - set to false to skip the action without deleting it. Defaults to true.
  • continue_on_error - when true, a failure in this action does not halt the run. Defaults to false.

Calls a registered action by its namespaced id (plugin.action_name) and renders its templated config.

actions:
- action: incident.create
config:
title: "{{ trigger.payload.systemName }} is down"
severity: critical
systemIds: ["{{ trigger.payload.systemId }}"]

if / elif / else branching. Each clause has a when condition and a sequence that runs when it is the first matching clause. The optional else runs when no clause matches.

actions:
- choose:
- when: "trigger.payload.severity == 'critical'"
sequence:
- action: notification.send
config: { title: "Critical", body: "Paging on-call" }
- when: "trigger.payload.severity == 'warning'"
sequence:
- action: notification.send
config: { title: "Warning", body: "FYI" }
else:
- action: notification.send
config: { title: "Info", body: "Logged only" }

Fans out actions concurrently and waits for all of them. Each branch is itself an action - wrap multi-step branches in a sequence.

actions:
- parallel:
- action: notification.send
config: { title: "Notify ops" }
- sequence:
- action: incident.create
config: { title: "Investigate", severity: warning }
- action: notification.send
config: { title: "Ticket opened" }

Wraps an ordered list of actions as a single action. Useful as a multi-action branch inside parallel / choose, or to apply one id / continue_on_error to a group atomically.

actions:
- sequence:
id: triage
continue_on_error: true
sequence:
- action: incident.create
config: { title: "Triage", severity: warning }
- action: notification.send
config: { title: "Triage started" }

Sleeps for a fixed or templated number of seconds (max 86400). The run suspends durably and resumes when the delay elapses.

actions:
- delay: { seconds: 300 } # five minutes
- delay: { template: "{{ trigger.payload.cooldownSeconds }}" }

Loops a sequence in one of four modes. repeat.index is exposed in every mode; for_each also exposes repeat.item.

  • count - run the sequence a fixed number of times.
  • for_each - a template rendering to a JSON array; run once per item.
  • while - evaluate a condition before each iteration; stop when false.
  • until - evaluate a condition after each iteration; stop when true.

while / until accept an optional max_iterations (defaults to 1000) as a safety net.

actions:
- repeat:
count: 3
sequence:
- action: notification.send
config: { title: "Reminder {{ repeat.index }}" }
- repeat:
for_each: "{{ trigger.payload.affectedSystems }}"
sequence:
- action: incident.create
config: { title: "{{ repeat.item }} affected", severity: warning }
- repeat:
while: "health.system.status != 'healthy'"
max_iterations: 20
sequence:
- wait_until:
condition: "health.system.status == 'healthy'"

Defines local scoped values for downstream actions. Values can be literals or templates; templates render at execution time and the rendered value is stored under the variable name.

actions:
- variables:
threshold: 500
summary: "{{ trigger.payload.systemName }} at {{ trigger.payload.p95 }}ms"
- action: notification.send
config: { title: "{{ var.summary }}" }

A mid-run guard. If the condition is false the run halts (unless continue_on_error: true). Accepts any condition shape - a template string or a structured variant.

actions:
- condition: "health.system.status == 'unhealthy'"
- action: incident.create
config: { title: "Confirmed unhealthy", severity: critical }

Explicitly halts the run, with an optional reason and an error flag. error: true marks the run as failed.

actions:
- choose:
- when: "health.system.in_maintenance"
sequence:
- stop: { reason: "System in maintenance window" }
- action: incident.create
config: { title: "Real outage", severity: critical }

Suspends the run until a matching event arrives, with an optional timeout (max 30 days). context_key defaults to the triggering event’s context key, so a wait inside an incident.created run matches the incident.resolved event for the same incident.

actions:
- action: incident.create
config: { title: "Outage", severity: critical }
- wait_for_trigger:
event: incident.resolved
timeout_seconds: 86400
filter: "trigger.payload.id == artifacts.incident.id"
- action: notification.send
config: { title: "Resolved within SLA" }

Suspends the run until a condition becomes true, woken reactively by a relevant entity change (no polling). The condition counterpart to wait_for_trigger. If the condition is already true when reached, the run continues without suspending.

actions:
- action: incident.create
config: { title: "{{ trigger.payload.systemName }} down", severity: critical }
- wait_until:
condition: "health.system.status == 'healthy'"
timeout_seconds: 3600 # wait up to 1h
continue_on_timeout: true # default; false = fail the run on timeout
- action: incident.resolve
config: { incidentId: "{{ artifacts.incident.id }}" }

A trigger is the entry point. Every trigger has an event; built-in triggers also take config. Optional per-trigger fields: an id (a discriminator for trigger.id in choose clauses), a gating filter expression (a bare expression, no {{ }}), a for: dwell, and a window: rate gate.

triggers:
- event: healthcheck.system_degraded
id: payments_degraded
filter: "trigger.payload.systemId == 'payments-api'"

Fire only if the matched state still holds after a duration. Accepts a single-unit duration ({ seconds }, { minutes }, { hours }) or { template } rendering to seconds. Restart-safe and idempotent - a re-fire while armed preserves the original deadline.

triggers:
- event: healthcheck.system_degraded
for: { minutes: 30 }

Fire only after this trigger has fired (post-filter) at least count times within the trailing minutes, scoped per context key (e.g. per system). The engine records each qualifying occurrence in a durable append log and counts rows within the sliding window:

  • refire: every (default) fires on every occurrence at/over the threshold - it re-fires while the window stays over threshold, so debounce in the automation (mode: single + for:) if you want “page once”.
  • refire: once fires only on the crossing edge (when the count first reaches count) and re-arms naturally as old occurrences age out of the window and the count re-crosses.

The window gate runs AFTER filter (so only qualifying occurrences count) and BEFORE for: (so the two can compose). The count is read from shared Postgres, so it is identical on every pod; the single occurrence INSERT happens on the one pod that claims the emission from the work queue.

triggers:
- event: healthcheck.check_failed
window: { count: 5, minutes: 10 } # 5 check failures in 10 min, per system
actions:
- action: incident.create
config: { title: "{{ trigger.payload.systemId }} failing repeatedly", severity: warning }

A single automation covers ALL systems: the count is bucketed per partition key and fires per key, so one automation pages independently for every flapping system without enumerating them. By default the partition is the trigger’s built-in context key (systemId for health triggers, incidentId for incident triggers, and so on).

partitionBy overrides the dimension the count is bucketed by. It is a bare expression (same flavour as filter - no {{ }}) evaluated against the trigger scope; the result is coerced to a string and used as the partition key.

triggers:
- event: healthcheck.check_failed
window:
count: 5
minutes: 10
partitionBy: trigger.payload.severity # per-severity rate, across all systems
  • Omitted (default): the trigger’s built-in context key (e.g. systemId) - one window per system. Existing automations are unaffected.
  • An explicit equivalent of the default is partitionBy: trigger.payload.systemId.
  • A composite key is just an expression: partitionBy: trigger.payload.systemId + ":" + trigger.payload.checkId (per system-and-check).
  • Fallback: if the expression evaluates to null / undefined / empty, OR it fails to evaluate, the gate falls back to the built-in context key rather than counting globally (so a typo never collapses every partition into one bucket). Evaluation errors are logged.

Fires off a completed health check when a numeric field crosses an above / below threshold. Pair with for: for “above X for Y minutes”. field supports latencyMs, p95LatencyMs, and dotted collector paths like collectors.http.responseTimeMs.

triggers:
- event: automation.numeric_state
config:
field: p95LatencyMs
above: 500
for: { minutes: 10 }

Flapping detection (windowed transition count)

Section titled “Flapping detection (windowed transition count)”

Flapping is just the window: rate gate over the raw healthcheck.system_health_changed change event, filtered to unhealthy transitions. There is no dedicated flapping trigger or hook - healthcheck emits only the raw per-system aggregated-health change, and the engine does the counting. Use refire: once so a flapping system pages on the crossing edge rather than on every subsequent transition.

triggers:
- event: healthcheck.system_health_changed
id: flapping
filter: 'trigger.payload.newStatus != "healthy"' # count unhealthy transitions
window: { count: 3, minutes: 60, refire: once } # 3 in 60 min, per system
actions:
- action: incident.create
config: { title: "{{ trigger.payload.systemId }} is flapping", severity: warning }

Conditions are pre-run gates (top-level conditions) or mid-run guards (the condition action). Beyond raw template strings and the and / or / not combinators, three structured variants are available. The raw template string stays the escape hatch for anything they do not cover.

conditions:
- and:
- "health.system.status == 'unhealthy'"
- or:
- "trigger.payload.severity == 'critical'"
- not: "health.system.in_maintenance"

Compare a numeric value (a literal, or a template/path resolved against scope) to above / below bounds. With both, the value must fall in the open band between them.

conditions:
- numeric_state:
value: "health.system.p95_latency_ms"
above: 500

On-call / quiet-hours gating. after / before are HH:mm (24h) local to timezone (IANA, defaults to UTC); weekday is a list of 0-6 (Sunday = 0). An after greater than before is an overnight window wrapping midnight.

conditions:
- time:
after: "22:00"
before: "06:00"
weekday: [1, 2, 3, 4, 5]
timezone: "Europe/Berlin"

True when entity (a catalog system id) is in status, optionally held for at least for. Reads the pre-resolved health.systems[entity].in_status_for_ms - no new timer.

conditions:
- state:
entity: "payments-api"
status: unhealthy
for: { minutes: 30 }