Health checks
A health check is a scheduled probe that asks “is this system OK right now?” Health checks are the source of every healthy or unhealthy reading you see in the UI, every state-change notification, and every long-term availability chart. This page covers the everyday concepts; for authoring custom checks see the Developer Guide.
The basics
Section titled “The basics”A health check is made of:
- A strategy that defines how to connect to the target (HTTP, SSH, PostgreSQL, Redis, DNS, Ping, TCP, TLS, …).
- A configuration that tells the strategy where and how to connect (URL, port, query, credentials).
- An interval in seconds. Every interval the platform runs the check on every system it is attached to.
- A set of systems it runs against. The same check definition can be reused for many systems; in the UI you assign it to systems individually, by group, or via templates.
Strategies vs collectors
Section titled “Strategies vs collectors”You may see two related terms in the UI: strategies and collectors.
- A strategy establishes a transport-level connection. The SSH strategy gets you a shell on a host; the PostgreSQL strategy opens a SQL session; the HTTP strategy makes an HTTP request.
- A collector runs on top of a strategy’s connection and produces a specific kind of data. An SSH strategy can host CPU, memory, and disk collectors that each run their own commands and parse the output.
For most checks you only ever see strategies. Collectors come into play when a single transport (typically SSH) is used to gather many independent metrics. You configure which collectors run on the same form as the strategy.
Scheduling
Section titled “Scheduling”The platform schedules each check independently. A check with intervalSeconds: 60 runs once per minute on every system it is attached to. There is no fancy distributed cron: the backend keeps an internal scheduler that fires queue jobs at the right time.
If you need a check to run from another network, attach satellites to it. Each satellite executes the check on its side and ships results back. You can also keep running the check locally at the same time (the includeLocal toggle in the editor). See Satellites.
Results, states, and thresholds
Section titled “Results, states, and thresholds”Every run of a check produces a result. The platform reduces each result to one of three states:
- healthy: the check is operating normally.
- degraded: the check is producing partial or warning-level output.
- unhealthy: the check is failing.
A single failed run does not immediately mark a system unhealthy. Checkstack uses state thresholds to debounce noisy probes. The defaults work for most setups:
| State | Default rule |
|---|---|
| healthy | Becomes healthy after 1 consecutive success. |
| degraded | Becomes degraded after 2 consecutive failures. |
| unhealthy | Becomes unhealthy after 5 consecutive failures. |
You can override thresholds per (system, check) pair if a specific assignment is more or less sensitive than the default. A “window-based” threshold mode is also available, for cases where you want to react to “X failures in the last N runs” rather than “X consecutive failures”.
Thresholds apply to state transitions, not to the underlying runs. Every run is still stored, so latency charts and detailed history are unaffected by debouncing.
Anomaly detection
Section titled “Anomaly detection”For numeric metrics inside a check’s result (latency, error rate, queue depth, …), Checkstack can flag anomalies even when the overall state is still “healthy”. An anomaly is a metric reading that drifts outside its expected range based on recent history.
Anomalies generate their own notifications and show up on system detail pages, but they do not change a system’s health state. They are designed to surface “this is weird, look at it” signals before they become outright failures.
Data retention
Section titled “Data retention”Raw check results add up quickly. Checkstack aggregates them on a tiered schedule so old data still fuels long-term charts without ballooning the database:
| Tier | Default retention |
|---|---|
| Raw runs | 7 days |
| Hourly aggregates | 30 days |
| Daily aggregates | 365 days |
The retention pipeline runs in the background and is configurable per health-check assignment. See Retention and limits (full reference) and the data management developer doc for the internals.
Script health checks
Section titled “Script health checks”If none of the bundled strategies fit, the script health check lets you write the probe as a small piece of code. You provide the script, Checkstack runs it on the schedule you set, and the script returns a result the platform can grade. This is the escape hatch for one-off checks that do not warrant a full plugin.
See Script health checks for the runtime contract and security model.
How a check moves through the system
Section titled “How a check moves through the system”A simplified view of one run:
[scheduler] ----> queue job ----> [executor] ----> [strategy.connect()] | | | v | [collector(s) run] v | record HealthCheckRun <-----+ | v [state evaluator] applies thresholds | v state transition? -> notify subscribers | v [aggregation] rolls into hourly bucketThe notification step honours Incident and Maintenance silencing for affected systems, so an already-reported outage does not flood the chat.
UI tour
Section titled “UI tour”| Where to go | What you do there |
|---|---|
| Health Checks -> Configurations | Create and edit check definitions. Pick a strategy, configure it, set the interval. |
| System detail page | Attach a check to a system, override thresholds, view latency and status charts. |
| Health Checks -> Templates | Save common configurations as templates to apply to many systems. |
Where to go next
Section titled “Where to go next”- Hands-on. Walk through Set up your first health check.
- Custom logic. Read Script health checks for the scripted escape hatch.
- Remote execution. See Satellites when you need to monitor from elsewhere.
- Notifications. Read Notifications to understand who hears about a state change.