Skip to content

Health Check Data Management

The health check platform handles high-volume execution data through a tiered storage model with automated aggregation. This ensures both deep diagnostic capabilities for recent data and long-term trending for historical analysis.

Data flows through three distinct tiers, each optimized for different use cases:

TierStorage TableDefault RetentionContentsUse Case
Rawhealth_check_runs7 daysFull run data including strategy-specific result JSONBRecent diagnostics, per-run analysis
Hourlyhealth_check_aggregates30 daysBucketed summaries with aggregatedResultMedium-term trending, detailed charts
Dailyhealth_check_aggregates365 daysDaily summaries (core metrics only)Long-term trending, capacity planning

The platform automatically calculates these metrics for each bucket:

  • Run Counts: runCount, healthyCount, degradedCount, unhealthyCount
  • Success Rate: Calculated as healthyCount / runCount
  • Latency Statistics: avgLatencyMs, minLatencyMs, maxLatencyMs, p95LatencyMs

Health check strategies can contribute custom aggregated data via the required aggregateResult hook:

interface HealthCheckStrategy<TConfig, TResult, TAggregatedResult> {
// ... other fields
/**
* REQUIRED: Schema for aggregated result data.
* Defines the shape stored in health_check_aggregates.aggregatedResult
*/
aggregatedResult: VersionedSchema<TAggregatedResult>;
/**
* REQUIRED: Incrementally merge a new run into the aggregated result.
* Called during real-time aggregation and retention processing.
*/
mergeResult(
existing: TAggregatedResult | undefined,
run: { status: string; latencyMs?: number; metadata?: TResult }
): TAggregatedResult;
}

Example: HTTP Check Strategy

const httpCheckStrategy: HealthCheckStrategy<HttpConfig, HttpResult, HttpAggregatedResult> = {
id: "http-check",
displayName: "HTTP Health Check",
aggregatedResult: {
version: 1,
schema: z.object({
// Display fields
statusCodeDistribution: z.record(z.string(), z.number()),
avgResponseTimeMs: z.number(),
errorRate: z.number(),
// Internal state for incremental aggregation
_responseTime: averageStateSchema,
_errorRate: rateStateSchema,
}),
},
mergeResult(existing, run) {
const metadata = run.metadata;
const code = String(metadata?.statusCode ?? 0);
// Merge status code distribution
const statusCodes = { ...(existing?.statusCodeDistribution ?? {}) };
statusCodes[code] = (statusCodes[code] ?? 0) + 1;
// Merge averages and rates using utilities
const responseTime = mergeAverage(existing?._responseTime, run.latencyMs);
const errorRate = mergeRate(existing?._errorRate, run.status === "healthy");
return {
statusCodeDistribution: statusCodes,
avgResponseTimeMs: responseTime.avg,
errorRate: 100 - errorRate.rate, // Convert success rate to error rate
_responseTime: responseTime,
_errorRate: errorRate,
};
},
// ... createClient and other methods
};

A daily background job manages the data lifecycle:

  1. Identifies raw runs older than rawRetentionDays
  2. For each run, calls strategy.mergeResult(existing, run) to incrementally aggregate
  3. Upserts into health_check_aggregates with bucketSize: 'hourly'
  4. Deletes processed raw runs
  1. Identifies hourly aggregates older than hourlyRetentionDays
  2. Groups hourly buckets by day
  3. Calculates weighted-average latency: SUM(avg * runCount) / SUM(runCount)
  4. Keeps global min/max latency across all hourly buckets
  5. Inserts daily aggregate (note: P95 and aggregatedResult are dropped)
  6. Deletes processed hourly aggregates

Deletes daily aggregates older than dailyRetentionDays.

Retention can be customized per system-assignment via the RetentionConfig schema:

const RetentionConfigSchema = z.object({
/** Days to keep raw run data (1-30, default: 7) */
rawRetentionDays: z.number().int().min(1).max(30).default(7),
/** Days to keep hourly aggregates (7-90, default: 30) */
hourlyRetentionDays: z.number().int().min(7).max(90).default(30),
/** Days to keep daily aggregates (30-1095, default: 365) */
dailyRetentionDays: z.number().int().min(30).max(1095).default(365),
});

The platform enforces a strict hierarchy: rawRetentionDays < hourlyRetentionDays < dailyRetentionDays. Violations result in a BAD_REQUEST error.

EndpointDescription
getRetentionConfigGet current retention settings for an assignment
updateRetentionConfigUpdate retention settings (pass null to reset to defaults)

For unified chart rendering, the system uses cross-tier aggregation to query from raw, hourly, and daily storage, merging with priority:

  1. Target Points: Frontend requests a fixed number of data points (e.g., 500)
  2. Dynamic Bucket Calculation: (endDate - startDate) / targetPoints determines bucket interval
  3. Tier Selection: Automatically queries the appropriate tier(s) based on interval
  4. Priority Merge: Raw data takes priority over hourly, which takes priority over daily
  5. Re-aggregation: Merged data is re-aggregated to match target bucket interval
// Service method signature
async getAggregatedHistory(
props: {
systemId: string;
configurationId: string;
startDate: Date;
endDate: Date;
targetPoints?: number; // Default: 500
},
options: { includeAggregatedResult: boolean }
)

Aggregated data access follows the same tiered access model as raw data:

EndpointAccessReturns
getAggregatedHistoryhealthCheckStatusReadCore metrics only (AggregatedBucketBase)
getDetailedAggregatedHistoryhealthCheckDetailsReadCore metrics + aggregatedResult

Every strategy must provide a mergeResult implementation. Without it, long-term historical views will lack strategy-specific insights.

The aggregatedResult is stored as JSONB. Design it to capture essential trends without replicating all raw data:

// ✅ Good - summary statistics
{
statusCodeDistribution: { "200": 95, "500": 5 },
errorRate: 0.05,
}
// ❌ Bad - too detailed for aggregation
{
allStatusCodes: [200, 200, 500, 200, ...],
allErrors: [{ timestamp: ..., message: ... }, ...],
}

During daily rollup, aggregatedResult is dropped. Strategy diagrams should handle undefined aggregated results for very old data.