Entity state machine
The entity state machine is the framework primitive every plugin uses to make its domain state reactive. defineEntity is a reactive WRAPPER: it owns NO current-state storage of its own. Each plugin keeps its state where it already lives (its own table, an in-memory map, or a computed aggregate) and supplies a read accessor; every reactive-state write goes through the single handle.mutate entry point, which snapshots the previous state via read, runs the plugin’s own write, and records the change. One defineEntity declaration replaces an ad-hoc change hook, a current-state query helper, and the since/duration bookkeeping that domains used to reimplement on their own - without duplicating the state itself. The automation engine consumes the resulting reactive surface for trigger routing, condition scope, and reactive waits, so state declared this way is automatically visible to automations and state declared off-pattern is structurally invisible to them. This page is the plugin-author guide; for the queue/wake machinery the change events drive, see the reactive dispatch pipeline.
Design principle: no state duplication
Section titled “Design principle: no state duplication”The framework NEVER owns a plugin’s current state. There is no framework-managed mirror table that domains copy their state into. Instead:
- The plugin owns its current state - in its own durable table, or a value computed on the fly from its own durable tables. It exposes that state through a required batched
readaccessor. EVERY kind is plugin-backed: there is no framework-owned current-state storage and no “homeless” kind that falls back to a generic framework table. - The framework owns only the change HISTORY. For every kind, on every real change, it appends field-level rows to its own
entity_transitionstable. This history is always platform-kept, so even a compute-on-read kind likehealthgets durable transition history. - All reactive-state writes go through one driven entry point -
handle.mutate({ id, apply }). The handle snapshotsprevviareadbefore the write, runs the plugin’sapply(the real write against the plugin’s own storage), then diffs and records the transition.
A kind’s read resolves one of two ways, both plugin-backed:
- Plugin-table-backed - the reactive subset is projected straight off the plugin’s own table(s). Incident, maintenance, catalog (system + group), and dependency all read this way.
- Compute-on-read - the reactive subset has no stored status row at all; it is derived on demand from the plugin’s own durable data. The per-system
healthaggregate, theslobudget/streak view, andsatellite-connection(status computed fromlast_heartbeat_at) compute this way, so there is no second materialized copy to drift from the source or get stuck after a pod crash.
When to use defineEntity vs the escape hatch
Section titled “When to use defineEntity vs the escape hatch”Default to defineEntity for anything that is genuinely mutable domain state an operator might want to react to: an incident’s status, a maintenance window, a system’s aggregate health, a dependency edge, a satellite’s connection, an SLO budget. The reactive path is the path of least resistance and the only one automations can see.
Reach for the escape hatch (declareNonReactiveState) only for data that looks like state but intentionally is not a reactive entity. There are four allowed reasons:
raw-sample- high-frequency, high-cardinality samples (e.g. individualhealth_check_runs). The aggregate is the entity; a firehose of raw rows would melt the wake-index. Raw samples can still be anumeric_statewake source without being an entity.sensitive- secret values, which must never enter reactive scope or change events. The metadata may be an entity; the value is excluded.externally-owned- state we cannot observe without polling (a live Jira issue status). Model the artifact you created, not a pretend-live entity.bookkeeping- internal operational data: cursors, caches, heartbeat timestamps (e.g.satellites.lastHeartbeatAt, the dependency propagation cursor).
import { entityExtensionPoint } from "@checkstack/automation-backend";
const entity = env.getExtensionPoint(entityExtensionPoint);entity.declareNonReactiveState({ table: "health_check_runs", reason: "raw-sample", note: "Aggregate health is the entity; raw runs are a numeric_state wake source only.",});Defining an entity
Section titled “Defining an entity”Resolve the automation.entity extension point and call defineEntity with a globally-unique kind, a zod object describing the reactive state, and the required read accessor that points at wherever the plugin already keeps that state. The zod schema is the single source of truth for typing, validation, scope projection, and the change-event shape, so it MUST be a z.object. The call returns a typed EntityHandle you keep on your service and mutate through.
import { z } from "zod";import { entityExtensionPoint, type EntityHandle,} from "@checkstack/automation-backend";
export const INCIDENT_ENTITY_KIND = "incident";
export const IncidentEntityStateSchema = z.object({ status: z.enum(["open", "acknowledged", "resolved"]), severity: z.enum(["warning", "critical"]), systemIds: z.array(z.string()),});export type IncidentEntityState = z.infer<typeof IncidentEntityStateSchema>;
let incidentEntity: EntityHandle<IncidentEntityState> | undefined;
export default createBackendPlugin({ metadata: pluginMetadata, register(env) { const entity = env.getExtensionPoint(entityExtensionPoint); incidentEntity = entity.defineEntity<IncidentEntityState>({ kind: INCIDENT_ENTITY_KIND, state: IncidentEntityStateSchema, // PLUGIN-BACKED: read straight from the incident service's own tables. // There is no framework copy of this state. read: (ids) => incidentService.getManyEntityStates(ids), }); // ... },});automation-backend registers the extension-point impl in its register() phase, so other plugins can resolve it and call defineEntity during their own register() or init(); cross-plugin calls are Proxy-buffered until the impl registers. Declare the entity in register() (as above) or in init() once your service and DB deps exist - whichever fits your plugin’s wiring. Keep the returned handle on your service and call it from every site that mutates your state and used to emit an ad-hoc hook.
A malformed registration hard-fails the loader at startup: a non-z.object state, a missing or duplicate kind, or a missing / non-function read.
The handle
Section titled “The handle”EntityHandle<TState> is the only typed path to reactive state. All writes go through mutate / remove; there is no set / patch:
interface EntityHandle<TState extends Record<string, unknown>> { readonly kind: string; mutate(input: { id: string; opts?: EntityMutationOpts; apply: () => Promise<TState>; }): Promise<TState>; remove(input: { id: string; opts?: EntityMutationOpts; apply: () => Promise<void>; }): Promise<void>; get(id: string): Promise<TState | undefined>; getMany(ids: ReadonlyArray<string>): Promise<Record<string, TState>>; inStateSince(id: string, field: keyof TState & string): Promise<Date | null>; inStateForMs(id: string, field: keyof TState & string): Promise<number>; transitionCount(args: { id: string; field: keyof TState & string; windowMs: number }): Promise<number>;}mutateis the single driven write. It snapshotsprevvia the kind’sreadaccessor BEFORE the write, runs yourapply(the REAL write against your own storage, committed in your own transaction, returning the resulting state asnext), then validatesnextagainst the kind’s zod object, structurally diffsprevagainstnext, appends the field-level transition rows, and emits a single change event - all only on a real diff. An unchanged write is a no-op (no event, no wake, no transition row). A throwingapplyrecords nothing and emits nothing: the plugin write is the source of truth.removeis the tombstone counterpart. Itsapplyperforms the plugin’s delete and returns void; the handle records the tombstone transition and emits a tombstone change event (next === null).get/getManyroute to yourreadaccessor; they are also the resolvers the engine uses for scope pre-resolution and wake re-evaluation.inStateSince/inStateForMs/transitionCountread the per-field transition log, generalizing the health-transition log to any entity.
EntityMutationOpts carries the mutating actor (defaults to the system actor; travels on the change event so automations read trigger.actor) and an optional runId. When the mutation originates inside a dispatch run, pass runId so the recorded prev / next are run through the run-secret mask - secret values must never enter reactive scope or change events.
A typical mutate call wraps the plugin’s own write:
// Inside the incident service, after computing the new state:await incidentEntity.mutate({ id: incidentId, // `apply` does the REAL incidents/junction write in the plugin's own tx and // returns the resulting reactive subset. The framework snapshots `prev` via // `read`, diffs, records the transition, and emits the change. apply: async () => { const incident = await this.persistIncident(incidentId, input); return { status: incident.status, severity: incident.severity, systemIds: incident.systemIds, }; },});The plugin write is authoritative. A plugin-backed kind keeps its state in its own schema, behind its own database client - a different client than the framework’s entity_transitions. The two cannot share one transaction, so apply runs and commits first, and the transition log is appended afterwards in the framework’s own transaction. If that append fails after a committed plugin write, the plugin state is still correct and only one history row is missing (a gap, never a corruption). This decoupling is deliberate: a plugin platform must not couple a plugin’s storage to a framework-internal table’s transaction.
Compute-on-read kinds
Section titled “Compute-on-read kinds”A kind whose reactive subset has no stored row of its own - it is derivable from the plugin’s own durable data - computes it on read instead of materializing a copy. The read accessor recomputes the view from the same tables the rest of the plugin already reads; handle.mutate snapshots prev through that same read before the write, so a real change still produces exactly one correct ENTITY_CHANGED.
The per-system health aggregate is the canonical example. It has no domain table and no framework storage; its read derives { status, healthyChecks, totalChecks } on demand from health_check_runs (via getSystemHealthStatus), gated on the system having at least one enabled check association. A system with an enabled check but no runs yet resolves to the same default-healthy baseline the executor reads, so a first-ever run that comes up unhealthy is a real healthy -> degraded diff (not a suppressed create):
import { z } from "zod";import { entityExtensionPoint, type EntityRead } from "@checkstack/automation-backend";
const HealthEntityStateSchema = z.object({ status: HealthCheckStatusSchema, healthyChecks: z.number().int().nonnegative(), totalChecks: z.number().int().nonnegative(),});
// COMPUTE-ON-READ: no stored row. Derive the view from the durable// health_check_runs the rest of the plugin reads. The entity resolves iff the// system has at least one ENABLED check association; a run-less system with an// enabled check resolves to the default-`healthy` baseline (the executor's// pre-run state), so a first-ever unhealthy run is a real healthy -> degraded// diff rather than a suppressed create.const read: EntityRead<HealthEntityState> = async (ids) => { const out: Record<string, HealthEntityState> = {}; await Promise.all(ids.map(async (systemId) => { const overview = await service.getSystemHealthStatus(systemId); // No enabled check associations => no health entity for this system. if (overview.checkStatuses.length === 0) return; out[systemId] = { status: overview.status, healthyChecks: overview.checkStatuses.filter((c) => c.status === "healthy").length, totalChecks: overview.checkStatuses.length, }; })); return out;};
const handle = entity.defineEntity<HealthEntityState>({ kind: "health", state: HealthEntityStateSchema, read,});
// At every evaluation-write site, `apply` does the REAL durable write (insert// the run + bump the aggregate) and returns the freshly-computed view. The// framework snapshots `prev` via `read` BEFORE the run is persisted, so the// diff and the emitted change are correct.await handle.mutate({ id: systemId, apply: async () => { await persistRunAndAggregate(systemId, result); return (await read([systemId]))[systemId]!; },});A COMPUTE-ON-READ kind whose state can be evaluated by more than one pod at once must serialize the snapshot-prev + apply + diff + emit per entity id. Without a lock, two concurrent evaluations of one id both snapshot the same prev, both write, and both diff the same transition - emitting two ENTITY_CHANGED (and two transition rows) for one logical change. The health kind wraps each system’s handle.mutate in a transaction-scoped advisory lock keyed health:<systemId> (withXactLock from @checkstack/backend-api), so concurrent per-config jobs across pods (or at-least-once redelivery) collapse to exactly one emit.
The slo entity computes the same way: its budgetRemainingPercent is a pure function of the objective’s append-only downtime history, so the read recomputes it via the SLO engine rather than storing a second copy.
Compute-on-read status with a pod-local live socket
Section titled “Compute-on-read status with a pod-local live socket”A kind can compute its status on read AND still keep some genuinely pod-local infrastructure - as long as that infrastructure is never the queryable source of truth. The satellite-connection entity is the canonical example. Its status is COMPUTED on read from the single durable liveness column satellites.last_heartbeat_at (via the same computeStatus(lastHeartbeatAt) the admin list uses), so it is globally consistent from any pod AND self-heals: a row left marked connected by a crashed pod reads offline once its heartbeat ages past the threshold, with no stored status copy that could get stuck online. The only extra durable column is last_connection_event (the deriver’s connected / disconnected / heartbeat_lost discriminator); lastSeenAt is derived from last_heartbeat_at:
const read: EntityRead<SatelliteConnectionState> = (ids) => service.getManyConnectionStates(ids); // computes status from last_heartbeat_at
const handle = entity.defineEntity<SatelliteConnectionState>({ kind: "satellite-connection", state: satelliteConnectionStateSchema, read,});
// Each lifecycle edge writes the durable liveness inputs through `apply`:// - connect: last_heartbeat_at = now, last_connection_event = "connected"// - disconnect: last_heartbeat_at = null, last_connection_event = "disconnected"// - heartbeat-lost (the monitor, from ANY pod): flips ONLY the event to// "heartbeat_lost" - idempotent, since once flipped re-runs are no-ops.await handle.mutate({ id: satelliteId, apply: () => service.applyConnectionState({ satelliteId, lastEvent, lastHeartbeatAt }),});The heartbeat monitor detects the online->offline edge from DURABLE state alone - it reads every satellite’s (last_heartbeat_at, last_connection_event), computes status, and fires heartbeat_lost for any that is offline while still marked connected. This works on whichever pod claims the check job, with NO pod-local baseline, fixing a defect where a pod with an empty in-memory map never observed the satellite online and so left the status stuck online after a crash. The in-process WebSocket registry still exists, but ONLY as the pod-local live-socket map used to route messages to a connection physically held by THIS pod; it is never read as the entity’s state. Mark it declareNonReactiveState({ reason: "bookkeeping" }). This is exactly the distinction the horizontal-scale rule draws: the durable, computed-on-read status is the entity, the socket map is pod-local bookkeeping.
Horizontal scale
Section titled “Horizontal scale”The platform runs as N pods sharing one database, so a reactive entity’s current state MUST be globally readable. Scope enrichment and wait_until re-evaluation run on whichever pod claims the dispatch/wake job, so a read that resolves from process-local memory returns a value written on pod A as invisible to pod B - the automation then reads stale or empty state. This is a real bug even when the single-process test suite passes (a value written in one process is trivially visible to the same process).
The rule, the guard, and the test:
- Rule. State management & horizontal scale is the principle: a reactive entity’s
readMUST resolve from shared, durable storage (the plugin’s own Postgres tables, or a derivation of them), never from process-local memory. In-memory is allowed only for genuinely pod-local infrastructure that is never the queryable source of truth (a live-socket registry), markeddeclareNonReactiveState. - Lint. The
checkstack/no-pod-local-entity-stateESLint rule is a forcing-function tripwire at thedefineEntity({ read })boundary: it flags areadthat reads aMap/Setor calls.get/.hason an in-memory structure, and warns (needsAssertion) when it cannot confirm durability so the author asserts it with a reason. It is wired atwarn, never escalated to an error. - Integration test.
cross-pod-read-consistency.it.test.tsis the deterministic backstop the single-process suite cannot provide. It models two pods as two registries over two connections pointed at the same database, and asserts a write driven on pod A is visible to pod B’sreadfor a durable kind (with a pod-local kind as a negative control that proves the test has teeth).
Working triggers do NOT prove the read path is scale-correct. Entity change events propagate cluster-wide via the event bus, so onEntityChanged consumers and trigger routing fire correctly even when the current-state read is pod-local and broken. Verify the read path separately.
Cross-plugin change subscriptions
Section titled “Cross-plugin change subscriptions”To react to ANOTHER domain’s entity changes, use onEntityChanged from the same extension point. The internal ENTITY_CHANGED hook is deliberately unexported, so defineEntity stays the only typed path that emits a change and onEntityChanged is the typed, validated path that consumes one. The handler receives the validated { kind, id, prev, next, delta, changedFields, actor, occurredAt }.
import { entityExtensionPoint } from "@checkstack/automation-backend";import { CATALOG_SYSTEM_ENTITY_KIND } from "@checkstack/catalog-backend";
const entity = env.getExtensionPoint(entityExtensionPoint);
// Clean up dependency edges when a catalog system is deleted.entity.onEntityChanged({ kind: CATALOG_SYSTEM_ENTITY_KIND, handler: async (change) => { if (change.next !== null) return; // tombstones only await service.removeSystemDependencies(change.id); }, delivery: { mode: "work-queue", workerGroup: "dependency-system-cleanup" },});Delivery semantics
Section titled “Delivery semantics”Pick the delivery mode by what the handler does:
broadcast(default) - every instance’s handler runs for every change. Correct for reactors that maintain per-instance state: an in-memory cache to invalidate, a local fan-out, a websocket push.work-queue- exactly one instance in the cluster runs the handler per change (load-balanced, retried). Correct for side-effecting work that must happen once per change: writing a derived row, enqueuing a notification, cleaning up associations. Requires aworkerGroupso distinct subscribers do not share a claim.
Subscriptions are registered eagerly during your plugin’s register() / init(); the underlying hook wiring is deferred until automation-backend reaches afterPluginsReady (the only place onHook is injected). The call returns an idempotent unsubscribe handle.
Change derivers
Section titled “Change derivers”A change deriver maps “this entity kind changed like THIS” to the qualified trigger event id(s) Stage-1 routing should fan out to fresh automation runs. This mapping is domain knowledge - incident’s incident.created / .resolved, health’s healthcheck.system_degraded - so it cannot live in the kind-agnostic engine. Register one per kind with registerChangeDeriver. A deriver is pure and synchronous, receives the validated EntityChanged, and returns the trigger event id(s) (an empty array means “fire nothing”).
import { entityExtensionPoint, type EntityChangeDeriver,} from "@checkstack/automation-backend";
const INCIDENT_TRIGGER_EVENTS = { created: "incident.created", updated: "incident.updated", resolved: "incident.resolved",} as const;
const deriveIncidentTriggerEvents: EntityChangeDeriver = (changed) => { if (changed.prev === null && changed.next !== null) { return [INCIDENT_TRIGGER_EVENTS.created]; } if (changed.next === null) return []; // tombstone, no event const prev = changed.prev?.["status"]; const next = changed.next?.["status"]; if (next === "resolved" && prev !== "resolved") { return [INCIDENT_TRIGGER_EVENTS.resolved]; } return [INCIDENT_TRIGGER_EVENTS.updated];};
const entity = env.getExtensionPoint(entityExtensionPoint);entity.registerChangeDeriver({ kind: INCIDENT_ENTITY_KIND, derive: deriveIncidentTriggerEvents, toPayload: incidentChangeToPayload, // see "Trigger payload mappers" below});Multiple derivers may be registered per kind; their outputs union (de-duplicated, registration order). A deriver that throws is skipped so it cannot wedge routing for the others.
Trigger payload mappers
Section titled “Trigger payload mappers”The generic entity-change shape ({ kind, id, prev, next, delta, changedFields, ... }) is engine-internal. Operators author filters and templates against the DOMAIN-named payload your trigger declares via payloadSchema - trigger.payload.incidentId, trigger.payload.systemId, trigger.payload.previousStatus. Supply an optional toPayload mapper alongside the deriver so the runtime trigger.payload matches that schema. Without one, Stage-2 falls back to the generic shape and those domain keys resolve to undefined.
toPayload is pure and synchronous, receives the same validated EntityChanged, and returns the domain payload. At most one mapper may be registered per kind (a second distinct mapper throws - the payload shape for a kind must be unambiguous).
import type { EntityChangePayloadMapper } from "@checkstack/automation-backend";
const incidentChangeToPayload: EntityChangePayloadMapper = (changed) => { const next = changed.next; const statusChanged = changed.changedFields.includes("status"); const status = next?.["status"]; return { incidentId: changed.id, // matches the trigger's payloadSchema systemIds: next?.["systemIds"] ?? [], severity: next?.["severity"], status, ...(statusChanged && typeof status === "string" ? { statusChange: status } : {}), };};Fields a trigger’s payloadSchema declares but that are NOT derivable from the reactive entity state (for example an incident’s title / createdAt, which are not part of the { status, severity, systemIds } reactive subset) should be marked OPTIONAL on the schema - the entity-driven payload cannot carry them.
How automations consume entities
Section titled “How automations consume entities”Two reactive consumers read entity state through each kind’s read accessor, both kind-agnostically:
- Scope projection. Before a run starts (and on resume, and at the trigger gate), the engine resolves the referenced entity refs through each kind’s
getManyresolver (which routes to that kind’sread) and folds them into scope understate.<kind>.<id>.<field>. Conditions and templates read it as plain data:state.slo['payments-slo'].budgetRemainingPercent,state.incident['abc'].severity. This minimalstate.<kind>view is the small reactive subset each kind’sdefineEntityexposes. Health additionally has a SEPARATE, richer projection:scope.health.*is the full health condition snapshot (status, latency, success rate, in-maintenance, transitions-in-window, …) resolved through the healthcheck RPC, because the health aggregate is computed on read rather than stored as a framework row. The two coexist by design, not as a migration shim:state.health.*is the minimal reactive entity view,scope.health.*is the rich snapshot the structuredstate/numeric_stateevaluators read, owned solely by the RPC path. - Reactive
wait_until. A suspendedwait_untilno longer polls. At suspend time the engine statically extracts thestate.*refs the condition reads and inserts wake-index rows keyed by${kind}:${id}. A relevant change wakes the wait, re-resolves scope, and re-evaluates the condition synchronously. When an id is dynamic the engine records a kind-level wildcard${kind}:*so the wait wakes on any change of that kind and re-evaluates - a few extra wakes, never a silent stall.
# React to a derived SLO budget threshold (no pre-baked budget hooks).triggers: - event: healthcheck.system_health_changedconditions: - numeric_state: value: "state.slo['payments-slo'].budgetRemainingPercent" below: 20actions: - action: notification.send config: { title: "Payments error budget under 20%" }See the sensing layer for the condition grammar and the state.<kind> scope reference, and the reactive dispatch pipeline for how a change becomes a run.