Operational Metrics Reference
Key operational metrics for measuring and communicating service reliability. Each metric includes its formula, interpretation, and industry benchmarks where available.
Core Reliability Metrics
| Metric | Full Name | Formula | Industry Benchmark |
|---|---|---|---|
| MTTR | Mean Time to Recovery | Total downtime / Number of incidents | < 1 hour for SEV1; < 4 hours for SEV2 |
| MTBF | Mean Time Between Failures | Total uptime / Number of failures | Varies by service tier; 30+ days for mature services |
| MTTA | Mean Time to Acknowledge | Sum(alert time to ack time) / Number of alerts | < 5 minutes for SEV1; < 15 minutes for SEV2 |
| MTTD | Mean Time to Detect | Sum(incident start to first alert) / Number of incidents | < 5 minutes with synthetic monitoring |
| Error Rate | Error Rate (5xx) | (5xx responses / Total responses) * 100 | < 0.1% normal; > 1% warrants investigation |
| Apdex | Application Performance Index | (Satisfied + Tolerating*0.5) / Total samples | 0.94+ Excellent; 0.85-0.93 Good; 0.70-0.84 Fair; < 0.70 Poor |
Latency Percentiles
Percentile latencies are more meaningful than averages for understanding user experience. An average of 100ms can hide a p99 of 5 seconds that affects 1% of your users on every request.
| Percentile | Meaning | Typical Web App Target | API Service Target | When to Use |
|---|---|---|---|---|
| p50 | Median -- half of requests are faster than this | < 200ms | < 50ms | General health indicator; represents typical experience |
| p90 | 90th percentile -- 10% of requests are slower | < 500ms | < 150ms | Useful for capacity planning |
| p95 | 95th percentile -- 5% of requests are slower | < 1000ms | < 300ms | Most common SLO target; good balance of sensitivity |
| p99 | 99th percentile -- 1% of requests are slower | < 2000ms | < 1000ms | Tail latency; reveals worst-case user experience |
| p99.9 | 99.9th percentile -- 0.1% of requests are slower | < 5000ms | < 3000ms | For high-volume services (millions of requests/day) |
Apdex Score Calculation
The Apdex (Application Performance Index) score converts response time measurements into a single number between 0 and 1 that represents user satisfaction. It requires setting a threshold T (e.g., 500ms).
Classification Rules
SatisfiedResponse time <= T (e.g., <= 500ms)
ToleratingResponse time > T and <= 4T (e.g., 500ms-2000ms)
FrustratedResponse time > 4T or request failed (e.g., > 2000ms)
Formula and Example
Apdex = (Satisfied + (Tolerating * 0.5)) / Total Samples
Example (T = 500ms, 1000 requests):
Satisfied (< 500ms): 850
Tolerating (500ms-2000ms): 100
Frustrated (> 2000ms or error): 50
Apdex = (850 + (100 * 0.5)) / 1000
Apdex = 900 / 1000
Apdex = 0.90 (Good)How Metrics Relate
| If This Metric Moves... | Check These | Possible Cause |
|---|---|---|
| MTTR increasing | MTTA, MTTD, error rate | Slow detection, complex root causes, insufficient runbooks |
| MTBF decreasing | Deploy frequency, error rate, change failure rate | Insufficient testing, growing technical debt, scaling issues |
| p99 latency spiking | p50 latency, error rate, CPU/memory saturation | Resource contention, GC pauses, slow queries, noisy neighbors |
| Apdex dropping | p95 latency, error rate, throughput | Degraded performance under load, dependency slowdown, partial outage |
| Error rate climbing | Deploy log, dependency health, latency | Bad deploy, downstream failure, capacity exhaustion |
No single metric tells the full story. Monitor metrics in combination and set alerts based on correlated signals to reduce false positives.