Monitoring for Status Pages

How to set up monitoring that feeds your status page accurately, from synthetic checks to automated incident detection and escalation.

Synthetic Monitoring

Synthetic monitoring probes your service from external locations on a fixed schedule, simulating user requests. It detects outages before real users report them and provides consistent baseline measurements.

HTTP endpoint checks

Send GET requests to critical endpoints every 30-60 seconds. Check for expected status code (200), response body content (e.g., a known string), and response time threshold. Run from at least 3 geographic locations.

Transaction monitoring

Script multi-step user flows (login, create resource, verify). Run every 5-15 minutes. These catch issues that simple ping-style checks miss, such as authentication failures or database connectivity problems.

DNS and certificate checks

Monitor DNS resolution time and verify TLS certificate validity. Alert when certificates are within 14 days of expiration. DNS propagation issues are a common cause of partial outages that basic HTTP checks can miss.

Health Check Patterns

Recommended health endpoint

GET /health

{
  "status": "healthy",
  "checks": {
    "database": { "status": "healthy", "latencyMs": 3 },
    "cache": { "status": "healthy", "latencyMs": 1 },
    "externalApi": { "status": "degraded", "latencyMs": 450 },
    "diskSpace": { "status": "healthy", "usagePercent": 62 }
  },
  "version": "2.4.1",
  "uptime": "14d 6h 32m"
}

Shallow vs deep checks: A shallow check returns 200 if the process is running. A deep check verifies all dependencies (database, cache, external APIs). Use shallow checks for load balancer health and deep checks for status page reporting.

Timeouts:Set a strict timeout on deep health checks (e.g., 5 seconds). A hanging dependency should make the health check return "degraded," not hang indefinitely.

Alerting Thresholds

MetricWarningCriticalEvaluation Window
Error rate (5xx)> 1% for 5 min> 5% for 2 minRolling 5 min
p95 latency> 500ms for 5 min> 2000ms for 2 minRolling 5 min
Availability< 99.9% over 1 hr< 99% over 15 minRolling 1 hr / 15 min
Saturation (CPU/Memory)> 80% for 10 min> 95% for 5 minRolling 10 min
Synthetic check failures2 consecutive from 1 location2 consecutive from 2+ locationsPer check cycle

Adjust thresholds based on your service's normal baseline. These are starting points, not universal values. Review and tune quarterly.

Escalation Policies

StepDelayActionStatus Page Update
1ImmediatePage primary on-call engineerAuto-create incident (Investigating)
2+5 minPage secondary on-call if no acknowledgmentNo change
3+15 minNotify engineering managerFirst public update if not already posted
4+30 minEscalate to VP/Director if SEV1 unresolvedSeverity may be upgraded

Monitoring Tool Comparison

ToolTypeStatus Page IntegrationPricing ModelStrengths
DatadogFull-stack observabilityNative integration with Statuspage.io; API-driven updatesPer host/monthDeep APM, log correlation, 600+ integrations
PagerDutyIncident managementBuilt-in status page; auto-updates from incidentsPer user/monthMature escalation policies, wide ecosystem
OpsGenieAlert managementStatuspage.io integration (both Atlassian); webhook-basedPer user/monthJira/Confluence integration, flexible routing
Grafana OnCallOn-call managementWebhook integration; pairs with Grafana dashboardsFree (OSS) / per user (Cloud)Open source, Grafana ecosystem, Terraform support