Monitoring for Status Pages
How to set up monitoring that feeds your status page accurately, from synthetic checks to automated incident detection and escalation.
Synthetic Monitoring
Synthetic monitoring probes your service from external locations on a fixed schedule, simulating user requests. It detects outages before real users report them and provides consistent baseline measurements.
HTTP endpoint checks
Send GET requests to critical endpoints every 30-60 seconds. Check for expected status code (200), response body content (e.g., a known string), and response time threshold. Run from at least 3 geographic locations.
Transaction monitoring
Script multi-step user flows (login, create resource, verify). Run every 5-15 minutes. These catch issues that simple ping-style checks miss, such as authentication failures or database connectivity problems.
DNS and certificate checks
Monitor DNS resolution time and verify TLS certificate validity. Alert when certificates are within 14 days of expiration. DNS propagation issues are a common cause of partial outages that basic HTTP checks can miss.
Health Check Patterns
Recommended health endpoint
GET /health
{
"status": "healthy",
"checks": {
"database": { "status": "healthy", "latencyMs": 3 },
"cache": { "status": "healthy", "latencyMs": 1 },
"externalApi": { "status": "degraded", "latencyMs": 450 },
"diskSpace": { "status": "healthy", "usagePercent": 62 }
},
"version": "2.4.1",
"uptime": "14d 6h 32m"
}Shallow vs deep checks: A shallow check returns 200 if the process is running. A deep check verifies all dependencies (database, cache, external APIs). Use shallow checks for load balancer health and deep checks for status page reporting.
Timeouts:Set a strict timeout on deep health checks (e.g., 5 seconds). A hanging dependency should make the health check return "degraded," not hang indefinitely.
Alerting Thresholds
| Metric | Warning | Critical | Evaluation Window |
|---|---|---|---|
| Error rate (5xx) | > 1% for 5 min | > 5% for 2 min | Rolling 5 min |
| p95 latency | > 500ms for 5 min | > 2000ms for 2 min | Rolling 5 min |
| Availability | < 99.9% over 1 hr | < 99% over 15 min | Rolling 1 hr / 15 min |
| Saturation (CPU/Memory) | > 80% for 10 min | > 95% for 5 min | Rolling 10 min |
| Synthetic check failures | 2 consecutive from 1 location | 2 consecutive from 2+ locations | Per check cycle |
Adjust thresholds based on your service's normal baseline. These are starting points, not universal values. Review and tune quarterly.
Escalation Policies
| Step | Delay | Action | Status Page Update |
|---|---|---|---|
| 1 | Immediate | Page primary on-call engineer | Auto-create incident (Investigating) |
| 2 | +5 min | Page secondary on-call if no acknowledgment | No change |
| 3 | +15 min | Notify engineering manager | First public update if not already posted |
| 4 | +30 min | Escalate to VP/Director if SEV1 unresolved | Severity may be upgraded |
Monitoring Tool Comparison
| Tool | Type | Status Page Integration | Pricing Model | Strengths |
|---|---|---|---|---|
| Datadog | Full-stack observability | Native integration with Statuspage.io; API-driven updates | Per host/month | Deep APM, log correlation, 600+ integrations |
| PagerDuty | Incident management | Built-in status page; auto-updates from incidents | Per user/month | Mature escalation policies, wide ecosystem |
| OpsGenie | Alert management | Statuspage.io integration (both Atlassian); webhook-based | Per user/month | Jira/Confluence integration, flexible routing |
| Grafana OnCall | On-call management | Webhook integration; pairs with Grafana dashboards | Free (OSS) / per user (Cloud) | Open source, Grafana ecosystem, Terraform support |