Health Check Data Management
Overview
Section titled “Overview”The health check platform handles high-volume execution data through a tiered storage model with automated aggregation. This ensures both deep diagnostic capabilities for recent data and long-term trending for historical analysis.
Tiered Storage Architecture
Section titled “Tiered Storage Architecture”Data flows through three distinct tiers, each optimized for different use cases:
| Tier | Storage Table | Default Retention | Contents | Use Case |
|---|---|---|---|---|
| Raw | health_check_runs | 7 days | Full run data including strategy-specific result JSONB | Recent diagnostics, per-run analysis |
| Hourly | health_check_aggregates | 30 days | Bucketed summaries with aggregatedResult | Medium-term trending, detailed charts |
| Daily | health_check_aggregates | 365 days | Daily summaries (core metrics only) | Long-term trending, capacity planning |
Aggregation Process
Section titled “Aggregation Process”Automatic Metrics
Section titled “Automatic Metrics”The platform automatically calculates these metrics for each bucket:
- Run Counts:
runCount,healthyCount,degradedCount,unhealthyCount - Success Rate: Calculated as
healthyCount / runCount - Latency Statistics:
avgLatencyMs,minLatencyMs,maxLatencyMs,p95LatencyMs
Strategy-Specific Aggregation
Section titled “Strategy-Specific Aggregation”Health check strategies can contribute custom aggregated data via the required aggregateResult hook:
interface HealthCheckStrategy<TConfig, TResult, TAggregatedResult> { // ... other fields
/** * REQUIRED: Schema for aggregated result data. * Defines the shape stored in health_check_aggregates.aggregatedResult */ aggregatedResult: VersionedSchema<TAggregatedResult>;
/** * REQUIRED: Incrementally merge a new run into the aggregated result. * Called during real-time aggregation and retention processing. */ mergeResult( existing: TAggregatedResult | undefined, run: { status: string; latencyMs?: number; metadata?: TResult } ): TAggregatedResult;}Example: HTTP Check Strategy
const httpCheckStrategy: HealthCheckStrategy<HttpConfig, HttpResult, HttpAggregatedResult> = { id: "http-check", displayName: "HTTP Health Check",
aggregatedResult: { version: 1, schema: z.object({ // Display fields statusCodeDistribution: z.record(z.string(), z.number()), avgResponseTimeMs: z.number(), errorRate: z.number(), // Internal state for incremental aggregation _responseTime: averageStateSchema, _errorRate: rateStateSchema, }), },
mergeResult(existing, run) { const metadata = run.metadata; const code = String(metadata?.statusCode ?? 0);
// Merge status code distribution const statusCodes = { ...(existing?.statusCodeDistribution ?? {}) }; statusCodes[code] = (statusCodes[code] ?? 0) + 1;
// Merge averages and rates using utilities const responseTime = mergeAverage(existing?._responseTime, run.latencyMs); const errorRate = mergeRate(existing?._errorRate, run.status === "healthy");
return { statusCodeDistribution: statusCodes, avgResponseTimeMs: responseTime.avg, errorRate: 100 - errorRate.rate, // Convert success rate to error rate _responseTime: responseTime, _errorRate: errorRate, }; },
// ... createClient and other methods};Retention Job
Section titled “Retention Job”A daily background job manages the data lifecycle:
Stage 1: Raw → Hourly
Section titled “Stage 1: Raw → Hourly”- Identifies raw runs older than
rawRetentionDays - For each run, calls
strategy.mergeResult(existing, run)to incrementally aggregate - Upserts into
health_check_aggregateswithbucketSize: 'hourly' - Deletes processed raw runs
Stage 2: Hourly → Daily
Section titled “Stage 2: Hourly → Daily”- Identifies hourly aggregates older than
hourlyRetentionDays - Groups hourly buckets by day
- Calculates weighted-average latency:
SUM(avg * runCount) / SUM(runCount) - Keeps global min/max latency across all hourly buckets
- Inserts daily aggregate (note: P95 and
aggregatedResultare dropped) - Deletes processed hourly aggregates
Stage 3: Expired Cleanup
Section titled “Stage 3: Expired Cleanup”Deletes daily aggregates older than dailyRetentionDays.
Configurable Retention
Section titled “Configurable Retention”Retention can be customized per system-assignment via the RetentionConfig schema:
const RetentionConfigSchema = z.object({ /** Days to keep raw run data (1-30, default: 7) */ rawRetentionDays: z.number().int().min(1).max(30).default(7), /** Days to keep hourly aggregates (7-90, default: 30) */ hourlyRetentionDays: z.number().int().min(7).max(90).default(30), /** Days to keep daily aggregates (30-1095, default: 365) */ dailyRetentionDays: z.number().int().min(30).max(1095).default(365),});The platform enforces a strict hierarchy: rawRetentionDays < hourlyRetentionDays < dailyRetentionDays. Violations result in a BAD_REQUEST error.
RPC Endpoints
Section titled “RPC Endpoints”| Endpoint | Description |
|---|---|
getRetentionConfig | Get current retention settings for an assignment |
updateRetentionConfig | Update retention settings (pass null to reset to defaults) |
On-the-Fly Aggregation
Section titled “On-the-Fly Aggregation”For unified chart rendering, the system uses cross-tier aggregation to query from raw, hourly, and daily storage, merging with priority:
- Target Points: Frontend requests a fixed number of data points (e.g., 500)
- Dynamic Bucket Calculation:
(endDate - startDate) / targetPointsdetermines bucket interval - Tier Selection: Automatically queries the appropriate tier(s) based on interval
- Priority Merge: Raw data takes priority over hourly, which takes priority over daily
- Re-aggregation: Merged data is re-aggregated to match target bucket interval
// Service method signatureasync getAggregatedHistory( props: { systemId: string; configurationId: string; startDate: Date; endDate: Date; targetPoints?: number; // Default: 500 }, options: { includeAggregatedResult: boolean })Access Model
Section titled “Access Model”Aggregated data access follows the same tiered access model as raw data:
| Endpoint | Access | Returns |
|---|---|---|
getAggregatedHistory | healthCheckStatusRead | Core metrics only (AggregatedBucketBase) |
getDetailedAggregatedHistory | healthCheckDetailsRead | Core metrics + aggregatedResult |
Best Practices
Section titled “Best Practices”1. Always Implement mergeResult
Section titled “1. Always Implement mergeResult”Every strategy must provide a mergeResult implementation. Without it, long-term historical views will lack strategy-specific insights.
2. Keep Aggregated Results Compact
Section titled “2. Keep Aggregated Results Compact”The aggregatedResult is stored as JSONB. Design it to capture essential trends without replicating all raw data:
// ✅ Good - summary statistics{ statusCodeDistribution: { "200": 95, "500": 5 }, errorRate: 0.05,}
// ❌ Bad - too detailed for aggregation{ allStatusCodes: [200, 200, 500, 200, ...], allErrors: [{ timestamp: ..., message: ... }, ...],}3. Handle Missing Data Gracefully
Section titled “3. Handle Missing Data Gracefully”During daily rollup, aggregatedResult is dropped. Strategy diagrams should handle undefined aggregated results for very old data.
Next Steps
Section titled “Next Steps”- Custom Chart Components - Build strategy-specific visualizations
- Extension Points - General extension system
- Queue System - Background job infrastructure