Automation platform
The automation platform lets operators wire triggers to ordered actions with full control flow (choose / parallel / repeat / delay / wait), and lets any plugin contribute triggers, actions, and artifact types. The engine is fully reactive: each plugin makes its own domain state reactive through the defineEntity wrapper (the framework records change history but never owns the state), and state changes drive triggers and waiting runs through a two-stage work-queue pipeline rather than polling. This page is the subsystem overview; see the sensing-layer and plugin-author pages for the building blocks and the extension API.
Building blocks
Section titled “Building blocks”- Trigger - an entry point. Either hook-backed (subscribes to a plugin hook) or setup-backed (manages its own schedule, e.g. cron). Declares a
contextKeyextractor that scopes artifact lookups and waits to a domain entity (a system, an incident). - Action - a unit of work. Renders its templated
config, runs, and optionallyproducesa typed artifact and/orconsumesupstream ones. - Artifact - a typed payload an action produced (e.g. a Jira issue, an incident), persisted so later actions in the run - or the run that resumes after a wait - can act on it.
- Condition - a pre-run gate or mid-run guard. A template string, an
and/or/notcombinator, or a structurednumeric_state/time/statevariant. - Control flow -
choose,parallel,repeat,sequence,delay,wait_for_trigger,wait_until,variables,stop.
An automation’s definition (triggers + conditions + actions + mode) is stored as JSON, validated by AutomationDefinitionSchema, and round-trips losslessly to YAML.
Row fields vs the definition
Section titled “Row fields vs the definition”A handful of operator-facing fields live as their own columns on the automations row rather than inside the definition JSON: name, description, group, status, and runAs. They are part of AutomationSchema (and the create / update inputs), not AutomationDefinitionSchema, so they are NOT round-tripped to YAML.
runAs is the id of the application (service account) the automation runs as - see Running as a service account.
group is a single optional free-text label (HA-style “category”) used purely to organise the list view into collapsible sections. An empty / absent group means the automation lives in the implicit “Ungrouped” bucket. listAutomations accepts an optional group filter, and listAutomationGroups returns the distinct non-null group values (sorted) that power the editor’s group picker. Because group is a row column, the list query can group and filter without parsing every definition blob.
For declaratively managed automations, the group is expressed in GitOps metadata - see the GitOps entity kinds reference.
Running as a service account
Section titled “Running as a service account”Every automation declares a runAs application (a service account). EVERY data-access call an action makes authenticates as that bounded identity, so an automation can never do more than its service account is allowed - the same access rules and team scope apply as for a human user.
How it works:
- At run time the dispatch engine builds a per-run RPC client bound to
runAs(a backend-minted, short-lived, signed app-principal token) and threads it into every action’sexecuteascontext.rpcClient. The token resolves to anapplicationprincipal, so it flows through the full access-rule and team-scope enforcement - it is never treated as a trustedservice(which would bypass all checks). The model/actions therefore cannot escalate beyond the service account, even when fed untrusted input. - Bind authority. A user may only set
runAsto an application whose access rules are a subset of their own (isApplicationBindable); a user can never grant an automation more authority than they hold. The editor’s picker lists only bindable applications (getBindableApplications); the create / update handlers re-enforce it. - Required.
runAsis required - an automation with no service account fails to run with a clear error rather than falling back to a trusted client. (Legacy rows created before this field must be assigned a service account before they run again.) - GitOps. Declarative automations set the service account via the
run-asmetadata label (admin/declarative, so no interactive bind check). - Service-account permissions. Grant the service account the access rules its actions need via roles - e.g.
notification.sendfor the built-innotify_useraction. Trusted backend services still bypass access checks; only principal-typed callers (users and service accounts) are gated. - Per-action access rules. An action declares the rules its runAs must hold via
requiredAccessRules, and the dispatch engine enforces them against the resolved runAs principal BEFORE the action runs (failing the step if missing). This is the ONLY authorization point for actions that resolve credentials through a trusted service rather than the boundedrpcClient- notably the integration actions (integration-jira.create_issue/search_issues/transition_issue/add_comment,integration-teams.post_message,integration-webex.post_message). Without it, merely being able to author automations would grant the ability to act on any configured connection. So creating Jira tickets, etc., requires the matching rule (e.g.integration-jira.create_issue.manage) granted to the service account’s role.automation.proposechecks the same up front, so an author sees a missing-permission error on the review card instead of a first-run failure.
The AI action
Section titled “The AI action”The built-in ai_analyze action runs a bounded AI agent on the run context
(the trigger payload + upstream artifacts, injected into the prompt). The agent
can investigate and act through the SAME tools the chat assistant uses - but as
the automation’s runAs service account, so it can never exceed that identity’s
permissions. Destructive tools are never offered (no human to confirm); mutating
tools auto-apply through the service account’s own client.
- Config:
connectionId(an OpenAI-compatible AI integration, chosen via a connection picker in the editor), optionalmodel/systemPrompt/maxSteps, theprompt, and optionaloutputFields- an author-defined set of typed fields (string/number/boolean/enum) the agent fills. - System prompt: the runner always injects a non-negotiable baseline that
tells the unattended agent its boundaries - it runs as a bounded service
account, any change it makes takes effect immediately and irreversibly (so it
only changes state when the task requires it), an empty result may be a
permission boundary rather than “nothing exists”, and it must ground concepts
in the docs and never fabricate values. An author-supplied
systemPromptAPPENDS role/task framing on top of this baseline; it never replaces it, so an override can add context but can never silently drop a safety line. - Artifact (
automation.analysis):{ summary, data, toolCalls }wheredatais the author-defined structured object. Downstream actionsconsumeit and branch on a field (e.g. a “severity” enum). WhenoutputFieldsare set, the structured-output pass validates the model’s object against the derived schema; on a schema miss the runner feeds the validation error back and retries a bounded number of times before failing the step, so a recoverable near-miss self-corrects but a malformed object never reaches a downstreamchoose/condition. - Engine: the agent loop lives in
@checkstack/ai-backendand is exposed as theaiAgentRunnerRefservice;automation-backend(which already depends on ai-backend) drives it. The runner resolves the allowed tools for the run’s principal and executes them through the run’srpcClient. - Notifying subscribers: the health-check plugin contributes
healthcheck.notifySystemSubscribers/notifySystemGroupSubscriberstools (and the chat assistant can use them too). They callnotification.notifyForSubscription, gated by thenotification.sendaccess rule, so the service account must hold it.
Example: a “system degraded” trigger -> an ai_analyze action that reads the
incident/system state, opens an incident, notifies the system’s subscribers, and
emits a severity field a following action branches on.
The agent is offered the service account’s read + mutate tools - both
hand-authored tools (run via their own handler) and projected *.list read
tools (routed through the live router as the principal). Destructive tools are
never offered. Every tool call the agent makes is written to the AI tool-call
audit log (ai_tool_calls) under the automation transport.
Failing or stopping the run from an AI action
Section titled “Failing or stopping the run from an AI action”The AI action plugs into the engine’s normal control flow - it needs no special “AI can stop the run” power (which would hand a prompt-injectable model an availability lever and hide the decision from the automation graph).
If the AI step itself fails (bad connection, no service account, the model
errors), it returns failure and - like any action - halts the run by default.
To make it best-effort so the rest of the run still proceeds, set
continue_on_error:
actions: - id: analyze action: automation.ai_analyze continue_on_error: true # AI is best-effort; a failure here doesn't halt config: connectionId: my-openai prompt: "Summarise the failure."To let the AI’s judgement stop the run, have it emit a structured field
and gate the flow on it with the normal choose / stop primitives - the
decision stays operator-authored and visible:
actions: - id: analyze action: automation.ai_analyze config: connectionId: my-openai prompt: "Analyse the failure. Set actionable=false for a known false positive." outputFields: - { key: actionable, type: boolean, description: "Should a human be paged" } - { key: severity, type: enum, description: "Severity", options: [low, high] } # Stop before paging when the AI judged this a non-actionable false positive. - choose: - when: "artifacts.analyze.analysis.data.actionable == false" sequence: - stop: { reason: "AI judged this a non-actionable false positive" } - action: incident.create config: title: "{{ artifacts.analyze.analysis.summary }}"The AI’s output lives at artifacts.<id>.analysis.data.<field> (here
artifacts.analyze.analysis.data.actionable); any condition, choose, or
stop can read it.
Dispatch lifecycle
Section titled “Dispatch lifecycle”- A trigger fires - a setup-backed schedule tick, or a reactive entity change. A change to a domain’s entity state is routed through the two-stage dispatch pipeline: Stage 1 (one instance claims) derives the qualified trigger event id(s) and the waiting runs to wake, and Stage 2 fans a per-run job out across instances.
- The trigger fan-in resolves the
contextKey, pre-resolves live state into scope (thestate.<kind>.<id>namespace, withhealth.*kept as a back-compat alias), and evaluates the triggerfilterand pre-runconditions. - Concurrency is applied per
mode(single / parallel / queued / restart), scoped perconcurrency_scope(whole automation, or per context key). The check-then-create is serialized under a transaction-scoped advisory lock keyed on(automationId, scope), so two concurrent fires (or a dwell-fire racing a fresh fire, or two pods) can’t both pass asingle-mode “no active run” check and both start a run. - The engine walks the action tree, persisting a step row per action and a durable scope snapshot after each step.
- Suspending actions persist a durable lock; the run resumes under a per-run advisory lock when woken.
delay/wait_for_trigger/ afor:dwell enqueue a wake job (or wait for a matching event); a reactivewait_untilrecords wake-index rows and a single timeout timer instead, and is woken by a relevant entity change. A stalled-run sweeper is the restart-safety backstop.
Every suspend survives a process restart: the durable row is the source of truth, the queue job is just the wake signal, and resumes take an advisory lock so no run double-fires.
Suspend / resume invariants
Section titled “Suspend / resume invariants”These guarantees keep suspended runs from being re-run or resurrected:
- A suspended (
waiting) run is owned by the wait-lock / queue resume paths, never the sweeper.findStalledRunIdsonly returnsstatus = 'running'runs (it joinsautomation_runs), and the suspend-finalisation does NOT clobber the run’slastActionPathcheckpoint tonull. Together this stops the sweeper from re-walking an intentional wait from the top (which would re-fire pre-wait side effects and leak a second wait lock). - Stalled recovery refuses a run that still holds a live wait lock.
recoverStalledRunonly recovers a genuinely-runningrun with no wait lock; if a lock exists it leaves the run to the wait/resume paths and deletes nothing - so a crash-mid-wait recovery can’t create a duplicate lock or duplicate delay job. - A cancelled / terminal run can never resume.
resumeRunguards onstatus === 'waiting'(mirroringcheckWaitUntil): any other status drops the stale wait lock and returns without resuming. Cancellation (operatorcancelRunorrestart-modecancelActiveRuns) deletes the affected runs’ wait locks and run-state in the same operation, so a later trigger / delay-expiry / racing queue job can’t wake a cancelled run. - A run resuming on a different pod re-seeds its output-mask set. The run-wide secret-masking registry is in-memory and per-process, so a pod that did not originally run a suspended automation starts with an empty mask set. Before walking or persisting, both
resumeRunandrecoverStalledRunre-resolve the automation’s declared secret refs (thesecretEnvmappings andconnectionIdreferences in its action configs) through the run’s wrappedgetService(which auto-registers each resolved value), so the resuming pod re-populates the same least-privilege, by-value mask set. Without this, a carried-over scope value / artifact / error persisted on the new pod could leak a credential resolved on the original pod.
Extension points
Section titled “Extension points”Plugins register into these in their register() phase:
automationTriggerExtensionPoint.registerTrigger(...)automationActionExtensionPoint.registerAction(...)automationArtifactTypeExtensionPoint.registerArtifactType(...)automationFilterExtensionPoint.registerFilter(...)- pure template filters.entityExtensionPoint.defineEntity(...)/declareNonReactiveState(...)/onEntityChanged(...)/registerChangeDeriver(...)- declare reactive entity state and react to cross-plugin changes. See the entity state machine.
The automation backend also exposes read-only service refs (automationRegistriesRef, automationArtifactStoreRef) for cross-plugin introspection and artifact lookups, and a GitOps Automation entity kind so automations can be declared in Git.
See the primitives reference for the shape and a runnable YAML example of every action, trigger, and condition, extending the automation platform for the registration API, the sensing layer for live state, duration filters, dwells, and structured conditions, the entity state machine for exposing reactive state, and the reactive dispatch pipeline for how a state change becomes a run.