Observability: Correlation IDs and Structured Logging

Every request that flows through a Checkstack oRPC router is tagged with a stable correlation ID and a small set of contextual metadata (plugin ID, user ID where applicable) so that the log lines produced by a single request can be reconstructed end-to-end. This is wired up once, in @checkstack/backend-api, and applies uniformly to every plugin router that uses the standard middleware chain.

The header contract

The HTTP header x-correlation-id is the single source of truth.

If the inbound request carries x-correlation-id, the platform uses that value verbatim. Callers (the React frontend, external scripts, peer services) own the trace and can hand the same ID into their own client-side logs to correlate the round trip.
If the header is absent or empty, the platform generates a fresh UUID v4 via crypto.randomUUID(). Handlers MUST NOT mint their own IDs - the middleware is the only generation site.
The chosen ID is echoed back on the response under the same x-correlation-id header, so the caller can log it after the fact even if they did not supply one upstream.

The header name is exported as a constant from @checkstack/backend-api so dev tools, integration tests, and fetch wrappers do not hard-code the string:

import { CORRELATION_ID_HEADER } from "@checkstack/backend-api";

// frontend example
const res = await fetch("/api/...", {
  headers: { [CORRELATION_ID_HEADER]: crypto.randomUUID() },
});

The middleware

correlationMiddleware is exported from @checkstack/backend-api and must be applied to every plugin router BEFORE autoAuthMiddleware. The order matters: correlation runs first so that auth failures still log with the correlation ID attached.

import {
  autoAuthMiddleware,
  correlationMiddleware,
  type RpcContext,
} from "@checkstack/backend-api";
import { implement } from "@orpc/server";

const os = implement(myContract)
  .$context<RpcContext>()
  .use(correlationMiddleware)
  .use(autoAuthMiddleware);

The scaffolding template at core/scripts/src/templates/backend/ ships this chain by default, so any plugin generated via bun run create already wires correlation in.

Auto-injected metadata fields

When a request enters a handler, ctx.logger is a child logger with the following fields pre-bound to every log entry it produces:

Field	Source	Always present?
`correlationId`	Inbound header or UUID v4 generated by mware	Yes
`pluginId`	`ctx.pluginMetadata.pluginId`	Yes
`userId`	`ctx.user.id` (real users + applications)	Only when `ctx.user` has an `id`

Service users (type: "service") do not have an id and so do not contribute a userId - those calls are still logged with correlationId + pluginId.

Adding more context inside a handler

Handlers can derive a tighter-scoped logger via .child({ ... }) for sub-operations (jobs, batched work, retries). The child inherits the correlation metadata automatically:

async function importBatch({ context }) {
  const log = context.logger.child?.({ batchId: "abc-123" }) ?? context.logger;
  log.info("starting import", { itemCount: items.length });
  // every line through `log` carries correlationId, pluginId, userId, AND batchId
}

Logger.child is optional on the interface so minimal test mocks do not have to implement it; production Winston loggers always do. Handlers that depend on it should branch on presence and fall back to the base logger when the method is absent.

Structured metadata vs. varargs

The Logger methods accept a trailing argument list (...args: unknown[]) so the long-standing varargs callsites - logger.error("…", err) where err is an Error, or logger.info("…", value1, value2) - keep working unchanged. Winston’s splat handling treats:

a single trailing Error instance as a special-cased error payload (with stack and message), and
a single trailing plain object as structured metadata that gets merged into the log entry.

For NEW code, prefer the structured-metadata shape:

logger.info("imported items", { count, durationMs, source });

Both shapes flow through the same vararg slot in the interface, so no overload churn is needed; the choice is purely stylistic and operational (structured metadata is far easier to grep in a log aggregator).

What this enables

Once the platform logger ingests the metadata, every entry produced by the request thread carries { correlationId, pluginId, userId? } - including framework-level lines (auth failures, validation errors, queue dispatch). Grepping a log aggregator by correlationId=… reconstructs the request end-to-end across plugins and back-to-back S2S hops.

For correlation across HTTP boundaries (e.g. the frontend that triggered the request), the response echo lets the caller log the ID it actually got, which is then identical to the ID in the server logs.

Metrics (OpenTelemetry + Prometheus)

Alongside structured logs, the backend can export OpenTelemetry metrics over a Prometheus endpoint so you can ground a performance investigation in real numbers instead of guesses. The whole layer is off by default and free when off: the instruments are OTel no-ops until a MeterProvider is registered, so the hot paths pay nothing until you opt in.

Enabling the exporter

Set CHECKSTACK_METRICS_ENABLED=1 and start the backend. The host registers a global MeterProvider and a Prometheus exporter that runs its own HTTP server (separate from the app router), so metrics carry no app-auth surface.

For local development, prefix the dev cockpit; the flag propagates to the backend it spawns:

CHECKSTACK_METRICS_ENABLED=1 bun run dev
# then, from another shell, once it has cycled:
curl -s http://127.0.0.1:9464/metrics | grep '^checkstack_'

Or run just the backend package directly (e.g. in a deployed process):

CHECKSTACK_METRICS_ENABLED=1 bun run --filter '@checkstack/backend' start

Env var	Default	Meaning
`CHECKSTACK_METRICS_ENABLED`	(unset)	Any non-empty value turns the exporter on.
`CHECKSTACK_METRICS_HOST`	`127.0.0.1`	Bind host. Use `0.0.0.0` only behind a firewall.
`CHECKSTACK_METRICS_PORT`	`9464`	Port for the `/metrics` endpoint.

What is exported

The instruments live in @checkstack/backend-api’s instrumentation module (lazy accessors that any plugin can record through) plus a few host-owned observable instruments:

Metric	Kind	Labels	What it tells you
`checkstack_db_transactions_total`	counter	`schema`	Scoped-DB transactions opened per plugin schema.
`checkstack_db_queries_total`	counter	`schema`	Standalone scoped queries (each wraps in its own tx).
`checkstack_db_query_duration`	histogram	`schema`, `operation`	Standalone scoped-query wall-clock (`BEGIN`+`SET LOCAL`+query+`COMMIT`), by operation kind.
`checkstack_db_transaction_duration`	histogram	`schema`	`withScopedTransaction` batch wall-clock = connection hold time.
`checkstack_healthcheck_execution_duration`	histogram	`status`	End-to-end run latency by outcome.
`checkstack_healthcheck_phase_duration`	histogram	`phase`	Per-phase timing (`connect`, `wait`, …) from run timings.
`checkstack_healthcheck_deferred`	counter	`reason`	Suspect env-runs skipped by the slow-check bulkhead (`lane_full`/`in_flight`).
`checkstack_queue_enqueued_total`	counter	`queue`	Jobs enqueued per queue.
`checkstack_queue_processed_total`	counter	`queue`, `status`	Jobs completed/failed per queue.
`checkstack_queue_jobs`	gauge	`state`	Queue depth: `pending` backlog + `processing`, across queues.
`checkstack_db_pool_connections`	gauge	`pool`, `state`	admin/lock pool `active`/`idle`/`waiting` counts.
`checkstack_runtime_event_loop_delay`	histogram	-	setInterval drift = how long the JS thread was blocked.

These are the direct tests for the questions a slowdown raises:

db_query_duration{operation=...} and db_transaction_duration answer “how long do queries take, and how long is a connection held”. The query histogram is bucketed by schema + operation (select/insert/update/delete/execute/$count) so a slow operation kind stands out per plugin; the transaction histogram is the connection-hold time of a withScopedTransaction batch (a rising p95 here means a batch is pinning a pooled connection - the thing to watch after batching an N+1, and a guard against accidentally wrapping slow non-DB work in a transaction). Both labels are BOUNDED; for the per-statement drill-down (which exact SQL is hot) use the query profiler below.
db_transactions_total minus db_queries_total per schema is the number of batched transactions. Batching an N+1 read fan-out into one withScopedTransaction shows up here as transactions rising far slower than the work done - the metric that proves the batching in drizzle-schema is actually taking effect in production.
healthcheck_phase_duration{phase="connect"} vs {phase="wait"} separates “slow to establish the connection” from “slow server” - a high connect p95 with a low wait points at connection establishment (TLS/TCP), not the target being slow or the platform being CPU-bound (which event_loop_delay measures independently).
queue_jobs{state="pending"} is the scale signal. A bounded, draining backlog means throughput keeps up; a pending that climbs without draining means work arrives faster than the queue concurrency can execute it - e.g. slow/timing-out health checks each pinning a concurrency slot for the full timeout. Watch it against db_pool_connections{pool="lock",state="waiting"}: if pending grows while lock-waiting stays at 0, you are slot-bound (raise concurrency or stop slow checks starving healthy ones), not database-bound.
healthcheck_deferred{reason="lane_full"} is the slow-check bulkhead engaging. During a correlated outage the suspect lane fills and further suspect env-runs are deferred (recording nothing, freeing the slot) so healthy checks keep draining - a rising lane_full alongside a bounded queue_jobs{state="pending"} is the bulkhead working as designed, not an error. See health-check execution.

Query profiler (pg_stat_statements)

The db_query_duration histogram tells you which OPERATION KIND is slow per schema; to find WHICH exact statement is hot, the backend can additionally export Postgres’ own pg_stat_statements view as metrics. This is the per-statement drill-down: normalized statement text, cumulative call count, total and mean execution time, and rows.

It is opt-in and self-disabling. When metrics are enabled the backend probes the connected database once at startup:

If pg_stat_statements is not active (the extension is not in shared_preload_libraries, or CREATE EXTENSION pg_stat_statements was never run, or the connecting role cannot read the view) the profiler registers nothing and logs a single info line. A deployment without the extension pays zero cost and sees no error.
If it is active, the profiler registers observable instruments read on each scrape.

Setting up Postgres for advanced profiling

pg_stat_statements is a Postgres contrib module. Activating it is a two-part job: the shared library must be preloaded at server start (shared_preload_libraries, which requires a restart), AND the extension must be created in the database Checkstack connects to. Creating the extension WITHOUT the preload leaves a non-functional view (relation "pg_stat_statements" does not exist), which is exactly the case the profiler treats as “not active” and no-ops.

Standalone / managed Postgres. Set the preload, restart, then create the extension:

-- 1) Preload the library (persisted; needs a restart to take effect).
ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';

-- 2) Restart the Postgres server (managed providers: use their restart control).

-- 3) After restart, create the extension in Checkstack's database and
--    (optionally) track statements inside functions too.
CREATE EXTENSION IF NOT EXISTS pg_stat_statements;
ALTER SYSTEM SET pg_stat_statements.track = 'all';
SELECT pg_reload_conf();

Verify the view is actually readable by the connecting role (superuser or a role granted pg_read_all_stats):

psql "$DATABASE_URL" -c 'SELECT count(*) FROM pg_stat_statements;'

If that returns a count (not an error), the profiler will enable itself on the next backend start; watch for the log line Metrics: pg_stat_statements query profiler enabled.

Local docker-compose dev. The dev Postgres has no preload by default. Enable it once against the running container, then restart it:

docker exec checkstack-postgres-1 psql -U checkstack -d checkstack \
  -c "ALTER SYSTEM SET shared_preload_libraries = 'pg_stat_statements';"
docker restart checkstack-postgres-1
docker exec checkstack-postgres-1 psql -U checkstack -d checkstack \
  -c "CREATE EXTENSION IF NOT EXISTS pg_stat_statements;"

To make it reproducible instead of a one-off, add the preload to the postgres service in docker-compose-dev.yml as a command flag (command: postgres -c shared_preload_libraries=pg_stat_statements) and still run CREATE EXTENSION once. Then start the backend with metrics on:

CHECKSTACK_METRICS_ENABLED=1 bun run dev
curl -s http://127.0.0.1:9464/metrics | grep '^checkstack_db_statements_'

Metric	Kind	Labels	What it tells you
`checkstack_db_statements_calls_total`	counter	`queryid`, `query`	Cumulative call count for a hot statement.
`checkstack_db_statements_exec_time_ms_total`	counter	`queryid`, `query`	Cumulative total execution time.
`checkstack_db_statements_rows_total`	counter	`queryid`, `query`	Cumulative rows returned/affected.
`checkstack_db_statements_mean_exec_time_ms`	gauge	`queryid`, `query`	Mean execution time per call.

queryid is Postgres’ stable statement fingerprint; query is the normalized statement text (parameters stripped by pg_stat_statements), collapsed and truncated for use as a label.

Env var	Default	Meaning
`CHECKSTACK_DB_STATEMENTS_TOP_N`	`25`	How many hottest statements (by total exec time) to export.

Analyzing a snapshot

Reading a raw /metrics scrape by eye is tedious, and a bug report often includes one (or two) pasted snapshots rather than live access. The profile:analyze script turns a snapshot into a ranked report - hot query paths, slowest-by-mean, batching effectiveness, transaction hold time, the pg_stat_statements drill-down, and auto-generated flags (unbatched high-volume schemas, pool saturation, event-loop starvation):

# One snapshot -> cumulative-since-boot totals.
bun run profile:analyze snapshot.txt

# Two snapshots (a baseline + a later scrape) -> the DELTA over that window,
# which is the accurate "what is hot right now". Order-independent.
bun run profile:analyze t0.txt t1.txt --interval 300

To capture a snapshot from a running backend (metrics enabled), scrape the exporter twice a few minutes apart:

curl -s http://127.0.0.1:9464/metrics > t0.txt
sleep 300
curl -s http://127.0.0.1:9464/metrics > t1.txt
bun run profile:analyze t0.txt t1.txt --interval 300

The script reads only Checkstack’s own metric families and degrades gracefully: a single snapshot, or one without pg_stat_statements, simply omits the affected section. It needs no DB or network access, so it is safe to run against an untrusted snapshot from an issue. Flags: --top <n> (rows per table), --min-calls <n> (floor for the slowest-by-mean table), --interval <seconds> (also express counts as per-second rates).

Recording from a plugin

Plugins do not wire anything up: import the accessor and record. It is a no-op until the host enables the exporter, so it is always safe to call:

import { queueEnqueuedCounter } from "@checkstack/backend-api";

queueEnqueuedCounter().add(1, { queue: myQueue.name });

The scoped-DB proxy, the health-check queue executor, and the in-memory queue already record through these accessors, so DB, health-check, and queue metrics populate with no per-plugin work.