Monitoring for Status Pages

How to set up monitoring that feeds your status page accurately, from synthetic checks to automated incident detection and escalation.

Synthetic Monitoring

Synthetic monitoring probes your service from external locations on a fixed schedule, simulating user requests. It detects outages before real users report them and provides consistent baseline measurements.

HTTP endpoint checks

Send GET requests to critical endpoints every 30-60 seconds. Check for expected status code (200), response body content (e.g., a known string), and response time threshold. Run from at least 3 geographic locations.

Transaction monitoring

Script multi-step user flows (login, create resource, verify). Run every 5-15 minutes. These catch issues that simple ping-style checks miss, such as authentication failures or database connectivity problems.

DNS and certificate checks

Monitor DNS resolution time and verify TLS certificate validity. Alert when certificates are within 14 days of expiration. DNS propagation issues are a common cause of partial outages that basic HTTP checks can miss.

Health Check Patterns

Recommended health endpoint

GET /health

{
  "status": "healthy",
  "checks": {
    "database": { "status": "healthy", "latencyMs": 3 },
    "cache": { "status": "healthy", "latencyMs": 1 },
    "externalApi": { "status": "degraded", "latencyMs": 450 },
    "diskSpace": { "status": "healthy", "usagePercent": 62 }
  },
  "version": "2.4.1",
  "uptime": "14d 6h 32m"
}

Shallow vs deep checks: A shallow check returns 200 if the process is running. A deep check verifies all dependencies (database, cache, external APIs). Use shallow checks for load balancer health and deep checks for status page reporting.

Timeouts:Set a strict timeout on deep health checks (e.g., 5 seconds). A hanging dependency should make the health check return "degraded," not hang indefinitely.

Alerting Thresholds

Metric	Warning	Critical	Evaluation Window
Error rate (5xx)	> 1% for 5 min	> 5% for 2 min	Rolling 5 min
p95 latency	> 500ms for 5 min	> 2000ms for 2 min	Rolling 5 min
Availability	< 99.9% over 1 hr	< 99% over 15 min	Rolling 1 hr / 15 min
Saturation (CPU/Memory)	> 80% for 10 min	> 95% for 5 min	Rolling 10 min
Synthetic check failures	2 consecutive from 1 location	2 consecutive from 2+ locations	Per check cycle

Adjust thresholds based on your service's normal baseline. These are starting points, not universal values. Review and tune quarterly.

Escalation Policies

Step	Delay	Action	Status Page Update
1	Immediate	Page primary on-call engineer	Auto-create incident (Investigating)
2	+5 min	Page secondary on-call if no acknowledgment	No change
3	+15 min	Notify engineering manager	First public update if not already posted
4	+30 min	Escalate to VP/Director if SEV1 unresolved	Severity may be upgraded

Monitoring Tool Comparison

Tool	Type	Status Page Integration	Pricing Model	Strengths
Datadog	Full-stack observability	Native integration with Statuspage.io; API-driven updates	Per host/month	Deep APM, log correlation, 600+ integrations
PagerDuty	Incident management	Built-in status page; auto-updates from incidents	Per user/month	Mature escalation policies, wide ecosystem
OpsGenie	Alert management	Statuspage.io integration (both Atlassian); webhook-based	Per user/month	Jira/Confluence integration, flexible routing
Grafana OnCall	On-call management	Webhook integration; pairs with Grafana dashboards	Free (OSS) / per user (Cloud)	Open source, Grafana ecosystem, Terraform support