The health check platform handles high-volume execution data through a tiered storage model with automated aggregation. This ensures both deep diagnostic capabilities for recent data and long-term trending for historical analysis.
Data flows through three distinct tiers, each optimized for different use cases:
| Tier | Storage Table | Default Retention | Contents | Use Case |
|---|---|---|---|---|
| Raw | health_check_runs |
7 days | Full run data including strategy-specific result JSONB |
Recent diagnostics, per-run analysis |
| Hourly | health_check_aggregates |
30 days | Bucketed summaries with aggregatedResult |
Medium-term trending, detailed charts |
| Daily | health_check_aggregates |
365 days | Daily summaries (core metrics only) | Long-term trending, capacity planning |
The platform automatically calculates these metrics for each bucket:
runCount, healthyCount, degradedCount, unhealthyCounthealthyCount / runCountavgLatencyMs, minLatencyMs, maxLatencyMs, p95LatencyMsHealth check strategies can contribute custom aggregated data via the required aggregateResult hook:
interface HealthCheckStrategy<TConfig, TResult, TAggregatedResult> {
// ... other fields
/**
* REQUIRED: Schema for aggregated result data.
* Defines the shape stored in health_check_aggregates.aggregatedResult
*/
aggregatedResult: VersionedSchema<TAggregatedResult>;
/**
* REQUIRED: Incrementally merge a new run into the aggregated result.
* Called during real-time aggregation and retention processing.
*/
mergeResult(
existing: TAggregatedResult | undefined,
run: { status: string; latencyMs?: number; metadata?: TResult }
): TAggregatedResult;
}
[!NOTE] The incremental
mergeResultpattern enables O(1) storage overhead by maintaining aggregation state directly, without accumulating raw runs in memory.
Example: HTTP Check Strategy
const httpCheckStrategy: HealthCheckStrategy<HttpConfig, HttpResult, HttpAggregatedResult> = {
id: "http-check",
displayName: "HTTP Health Check",
aggregatedResult: {
version: 1,
schema: z.object({
// Display fields
statusCodeDistribution: z.record(z.string(), z.number()),
avgResponseTimeMs: z.number(),
errorRate: z.number(),
// Internal state for incremental aggregation
_responseTime: averageStateSchema,
_errorRate: rateStateSchema,
}),
},
mergeResult(existing, run) {
const metadata = run.metadata;
const code = String(metadata?.statusCode ?? 0);
// Merge status code distribution
const statusCodes = { ...(existing?.statusCodeDistribution ?? {}) };
statusCodes[code] = (statusCodes[code] ?? 0) + 1;
// Merge averages and rates using utilities
const responseTime = mergeAverage(existing?._responseTime, run.latencyMs);
const errorRate = mergeRate(existing?._errorRate, run.status === "healthy");
return {
statusCodeDistribution: statusCodes,
avgResponseTimeMs: responseTime.avg,
errorRate: 100 - errorRate.rate, // Convert success rate to error rate
_responseTime: responseTime,
_errorRate: errorRate,
};
},
// ... createClient and other methods
};
A daily background job manages the data lifecycle:
rawRetentionDaysstrategy.mergeResult(existing, run) to incrementally aggregatehealth_check_aggregates with bucketSize: 'hourly'[!TIP] The mergeResult pattern enables real-time aggregation during execution, not just during retention processing.
hourlyRetentionDaysSUM(avg * runCount) / SUM(runCount)aggregatedResult are dropped)Deletes daily aggregates older than dailyRetentionDays.
Retention can be customized per system-assignment via the RetentionConfig schema:
const RetentionConfigSchema = z.object({
/** Days to keep raw run data (1-30, default: 7) */
rawRetentionDays: z.number().int().min(1).max(30).default(7),
/** Days to keep hourly aggregates (7-90, default: 30) */
hourlyRetentionDays: z.number().int().min(7).max(90).default(30),
/** Days to keep daily aggregates (30-1095, default: 365) */
dailyRetentionDays: z.number().int().min(30).max(1095).default(365),
});
[!IMPORTANT] The platform enforces a strict hierarchy:
rawRetentionDays < hourlyRetentionDays < dailyRetentionDays. Violations result in aBAD_REQUESTerror.
| Endpoint | Description |
|---|---|
getRetentionConfig |
Get current retention settings for an assignment |
updateRetentionConfig |
Update retention settings (pass null to reset to defaults) |
For unified chart rendering, the system uses cross-tier aggregation to query from raw, hourly, and daily storage, merging with priority:
(endDate - startDate) / targetPoints determines bucket interval// Service method signature
async getAggregatedHistory(
props: {
systemId: string;
configurationId: string;
startDate: Date;
endDate: Date;
targetPoints?: number; // Default: 500
},
options: { includeAggregatedResult: boolean }
)
Aggregated data access follows the same tiered access model as raw data:
| Endpoint | Access | Returns |
|---|---|---|
getAggregatedHistory |
healthCheckStatusRead |
Core metrics only (AggregatedBucketBase) |
getDetailedAggregatedHistory |
healthCheckDetailsRead |
Core metrics + aggregatedResult |
mergeResultEvery strategy must provide a mergeResult implementation. Without it, long-term historical views will lack strategy-specific insights.
The aggregatedResult is stored as JSONB. Design it to capture essential trends without replicating all raw data:
// ✅ Good - summary statistics
{
statusCodeDistribution: { "200": 95, "500": 5 },
errorRate: 0.05,
}
// ❌ Bad - too detailed for aggregation
{
allStatusCodes: [200, 200, 500, 200, ...],
allErrors: [{ timestamp: ..., message: ... }, ...],
}
During daily rollup, aggregatedResult is dropped. Strategy diagrams should handle undefined aggregated results for very old data.