Automation platform

The automation platform lets operators wire triggers to ordered actions with full control flow (choose / parallel / repeat / delay / wait), and lets any plugin contribute triggers, actions, and artifact types. The engine is fully reactive: each plugin makes its own domain state reactive through the defineEntity wrapper (the framework records change history but never owns the state), and state changes drive triggers and waiting runs through a two-stage work-queue pipeline rather than polling. This page is the subsystem overview; see the sensing-layer and plugin-author pages for the building blocks and the extension API.

Building blocks

Trigger - an entry point. Either hook-backed (subscribes to a plugin hook) or setup-backed (manages its own schedule, e.g. cron). Declares a contextKey extractor that scopes artifact lookups and waits to a domain entity (a system, an incident).
Action - a unit of work. Renders its templated config, runs, and optionally produces a typed artifact and/or consumes upstream ones.
Artifact - a typed payload an action produced (e.g. a Jira issue, an incident), persisted so later actions in the run - or the run that resumes after a wait - can act on it.
Condition - a pre-run gate or mid-run guard. A template string, an and / or / not combinator, or a structured numeric_state / time / state variant.
Control flow - choose, parallel, repeat, sequence, delay, wait_for_trigger, wait_until, variables, stop.

An automation’s definition (triggers + conditions + actions + mode) is stored as JSON, validated by AutomationDefinitionSchema, and round-trips losslessly to YAML.

Row fields vs the definition

A handful of operator-facing fields live as their own columns on the automations row rather than inside the definition JSON: name, description, group, status, and runAs. They are part of AutomationSchema (and the create / update inputs), not AutomationDefinitionSchema, so they are NOT round-tripped to YAML.

runAs is the id of the application (service account) the automation runs as - see Running as a service account.

group is a single optional free-text label (HA-style “category”) used purely to organise the list view into collapsible sections. An empty / absent group means the automation lives in the implicit “Ungrouped” bucket. listAutomations accepts an optional group filter, and listAutomationGroups returns the distinct non-null group values (sorted) that power the editor’s group picker. Because group is a row column, the list query can group and filter without parsing every definition blob.

For declaratively managed automations, the group is expressed in GitOps metadata - see the GitOps entity kinds reference.

Running as a service account

Every automation declares a runAs application (a service account). EVERY data-access call an action makes authenticates as that bounded identity, so an automation can never do more than its service account is allowed - the same access rules and team scope apply as for a human user.

How it works:

At run time the dispatch engine builds a per-run RPC client bound to runAs (a backend-minted, short-lived, signed app-principal token) and threads it into every action’s execute as context.rpcClient. The token resolves to an application principal, so it flows through the full access-rule and team-scope enforcement - it is never treated as a trusted service (which would bypass all checks). The model/actions therefore cannot escalate beyond the service account, even when fed untrusted input.
Bind authority. A user may only set runAs to an application whose access rules are a subset of their own (isApplicationBindable); a user can never grant an automation more authority than they hold. The editor’s picker lists only bindable applications (getBindableApplications); the create / update handlers re-enforce it.
Required. runAs is required - an automation with no service account fails to run with a clear error rather than falling back to a trusted client. (Legacy rows created before this field must be assigned a service account before they run again.)
GitOps. Declarative automations set the service account via the run-as metadata label (admin/declarative, so no interactive bind check).
Service-account permissions. Grant the service account the access rules its actions need via roles - e.g. notification.send for the built-in notify_user action. Trusted backend services still bypass access checks; only principal-typed callers (users and service accounts) are gated.
Per-action access rules. An action declares the rules its runAs must hold via requiredAccessRules, and the dispatch engine enforces them against the resolved runAs principal BEFORE the action runs (failing the step if missing). This is the ONLY authorization point for actions that resolve credentials through a trusted service rather than the bounded rpcClient - notably the integration actions (integration-jira.create_issue / search_issues / transition_issue / add_comment, integration-teams.post_message, integration-webex.post_message). Without it, merely being able to author automations would grant the ability to act on any configured connection. So creating Jira tickets, etc., requires the matching rule (e.g. integration-jira.create_issue.manage) granted to the service account’s role. automation.propose checks the same up front, so an author sees a missing-permission error on the review card instead of a first-run failure.

The AI action

The built-in ai_analyze action runs a bounded AI agent on the run context (the trigger payload + upstream artifacts, injected into the prompt). The agent can investigate and act through the SAME tools the chat assistant uses - but as the automation’s runAs service account, so it can never exceed that identity’s permissions. Destructive tools are never offered (no human to confirm); mutating tools auto-apply through the service account’s own client.

Config: connectionId (an OpenAI-compatible AI integration, chosen via a connection picker in the editor), optional model / systemPrompt / maxSteps, the prompt, and optional outputFields - an author-defined set of typed fields (string/number/boolean/enum) the agent fills.
System prompt: the runner always injects a non-negotiable baseline that tells the unattended agent its boundaries - it runs as a bounded service account, any change it makes takes effect immediately and irreversibly (so it only changes state when the task requires it), an empty result may be a permission boundary rather than “nothing exists”, and it must ground concepts in the docs and never fabricate values. An author-supplied systemPrompt APPENDS role/task framing on top of this baseline; it never replaces it, so an override can add context but can never silently drop a safety line.
Artifact (automation.analysis): { summary, data, toolCalls } where data is the author-defined structured object. Downstream actions consume it and branch on a field (e.g. a “severity” enum). When outputFields are set, the structured-output pass validates the model’s object against the derived schema; on a schema miss the runner feeds the validation error back and retries a bounded number of times before failing the step, so a recoverable near-miss self-corrects but a malformed object never reaches a downstream choose/condition.
Engine: the agent loop lives in @checkstack/ai-backend and is exposed as the aiAgentRunnerRef service; automation-backend (which already depends on ai-backend) drives it. The runner resolves the allowed tools for the run’s principal and executes them through the run’s rpcClient.
Notifying subscribers: the health-check plugin contributes healthcheck.notifySystemSubscribers / notifySystemGroupSubscribers tools (and the chat assistant can use them too). They call notification.notifyForSubscription, gated by the notification.send access rule, so the service account must hold it.

Example: a “system degraded” trigger -> an ai_analyze action that reads the incident/system state, opens an incident, notifies the system’s subscribers, and emits a severity field a following action branches on.

The agent is offered the service account’s read + mutate tools - both hand-authored tools (run via their own handler) and projected *.list read tools (routed through the live router as the principal). Destructive tools are never offered. Every tool call the agent makes is written to the AI tool-call audit log (ai_tool_calls) under the automation transport.

Failing or stopping the run from an AI action

The AI action plugs into the engine’s normal control flow - it needs no special “AI can stop the run” power (which would hand a prompt-injectable model an availability lever and hide the decision from the automation graph).

If the AI step itself fails (bad connection, no service account, the model errors), it returns failure and - like any action - halts the run by default. To make it best-effort so the rest of the run still proceeds, set continue_on_error:

actions:
  - id: analyze
    action: automation.ai_analyze
    continue_on_error: true # AI is best-effort; a failure here doesn't halt
    config:
      connectionId: my-openai
      prompt: "Summarise the failure."

To let the AI’s judgement stop the run, have it emit a structured field and gate the flow on it with the normal choose / stop primitives - the decision stays operator-authored and visible:

actions:
  - id: analyze
    action: automation.ai_analyze
    config:
      connectionId: my-openai
      prompt: "Analyse the failure. Set actionable=false for a known false positive."
      outputFields:
        - { key: actionable, type: boolean, description: "Should a human be paged" }
        - { key: severity, type: enum, description: "Severity", options: [low, high] }
  # Stop before paging when the AI judged this a non-actionable false positive.
  - choose:
      - when: "artifacts.analyze.analysis.data.actionable == false"
        sequence:
          - stop: { reason: "AI judged this a non-actionable false positive" }
  - action: incident.create
    config:
      title: "{{ artifacts.analyze.analysis.summary }}"

The AI’s output lives at artifacts.<id>.analysis.data.<field> (here artifacts.analyze.analysis.data.actionable); any condition, choose, or stop can read it.

Dispatch lifecycle

A trigger fires - a setup-backed schedule tick, or a reactive entity change. A change to a domain’s entity state is routed through the two-stage dispatch pipeline: Stage 1 (one instance claims) derives the qualified trigger event id(s) and the waiting runs to wake, and Stage 2 fans a per-run job out across instances.
The trigger fan-in resolves the contextKey, pre-resolves live state into scope (the state.<kind>.<id> namespace, with health.* kept as a back-compat alias), and evaluates the trigger filter and pre-run conditions.
Concurrency is applied per mode (single / parallel / queued / restart), scoped per concurrency_scope (whole automation, or per context key). The check-then-create is serialized under a transaction-scoped advisory lock keyed on (automationId, scope), so two concurrent fires (or a dwell-fire racing a fresh fire, or two pods) can’t both pass a single-mode “no active run” check and both start a run.
The engine walks the action tree, persisting a step row per action and a durable scope snapshot after each step.
Suspending actions persist a durable lock; the run resumes under a per-run advisory lock when woken. delay / wait_for_trigger / a for: dwell enqueue a wake job (or wait for a matching event); a reactive wait_until records wake-index rows and a single timeout timer instead, and is woken by a relevant entity change. A stalled-run sweeper is the restart-safety backstop.

Every suspend survives a process restart: the durable row is the source of truth, the queue job is just the wake signal, and resumes take an advisory lock so no run double-fires.

Suspend / resume invariants

These guarantees keep suspended runs from being re-run or resurrected:

A suspended (waiting) run is owned by the wait-lock / queue resume paths, never the sweeper. findStalledRunIds only returns status = 'running' runs (it joins automation_runs), and the suspend-finalisation does NOT clobber the run’s lastActionPath checkpoint to null. Together this stops the sweeper from re-walking an intentional wait from the top (which would re-fire pre-wait side effects and leak a second wait lock).
Stalled recovery refuses a run that still holds a live wait lock. recoverStalledRun only recovers a genuinely-running run with no wait lock; if a lock exists it leaves the run to the wait/resume paths and deletes nothing - so a crash-mid-wait recovery can’t create a duplicate lock or duplicate delay job.
A cancelled / terminal run can never resume. resumeRun guards on status === 'waiting' (mirroring checkWaitUntil): any other status drops the stale wait lock and returns without resuming. Cancellation (operator cancelRun or restart-mode cancelActiveRuns) deletes the affected runs’ wait locks and run-state in the same operation, so a later trigger / delay-expiry / racing queue job can’t wake a cancelled run.
A run resuming on a different pod re-seeds its output-mask set. The run-wide secret-masking registry is in-memory and per-process, so a pod that did not originally run a suspended automation starts with an empty mask set. Before walking or persisting, both resumeRun and recoverStalledRun re-resolve the automation’s declared secret refs (the secretEnv mappings and connectionId references in its action configs) through the run’s wrapped getService (which auto-registers each resolved value), so the resuming pod re-populates the same least-privilege, by-value mask set. Without this, a carried-over scope value / artifact / error persisted on the new pod could leak a credential resolved on the original pod.

Extension points

Plugins register into these in their register() phase:

automationTriggerExtensionPoint.registerTrigger(...)
automationActionExtensionPoint.registerAction(...)
automationArtifactTypeExtensionPoint.registerArtifactType(...)
automationFilterExtensionPoint.registerFilter(...) - pure template filters.
entityExtensionPoint.defineEntity(...) / declareNonReactiveState(...) / onEntityChanged(...) / registerChangeDeriver(...) - declare reactive entity state and react to cross-plugin changes. See the entity state machine.

The automation backend also exposes read-only service refs (automationRegistriesRef, automationArtifactStoreRef) for cross-plugin introspection and artifact lookups, and a GitOps Automation entity kind so automations can be declared in Git.

See the primitives reference for the shape and a runnable YAML example of every action, trigger, and condition, extending the automation platform for the registration API, the sensing layer for live state, duration filters, dwells, and structured conditions, the entity state machine for exposing reactive state, and the reactive dispatch pipeline for how a state change becomes a run.