Operational Metrics Reference

Key operational metrics for measuring and communicating service reliability. Each metric includes its formula, interpretation, and industry benchmarks where available.

Core Reliability Metrics

Metric	Full Name	Formula	Industry Benchmark
MTTR	Mean Time to Recovery	Total downtime / Number of incidents	< 1 hour for SEV1; < 4 hours for SEV2
MTBF	Mean Time Between Failures	Total uptime / Number of failures	Varies by service tier; 30+ days for mature services
MTTA	Mean Time to Acknowledge	Sum(alert time to ack time) / Number of alerts	< 5 minutes for SEV1; < 15 minutes for SEV2
MTTD	Mean Time to Detect	Sum(incident start to first alert) / Number of incidents	< 5 minutes with synthetic monitoring
Error Rate	Error Rate (5xx)	(5xx responses / Total responses) * 100	< 0.1% normal; > 1% warrants investigation
Apdex	Application Performance Index	(Satisfied + Tolerating*0.5) / Total samples	0.94+ Excellent; 0.85-0.93 Good; 0.70-0.84 Fair; < 0.70 Poor

Latency Percentiles

Percentile latencies are more meaningful than averages for understanding user experience. An average of 100ms can hide a p99 of 5 seconds that affects 1% of your users on every request.

Percentile	Meaning	Typical Web App Target	API Service Target	When to Use
p50	Median -- half of requests are faster than this	< 200ms	< 50ms	General health indicator; represents typical experience
p90	90th percentile -- 10% of requests are slower	< 500ms	< 150ms	Useful for capacity planning
p95	95th percentile -- 5% of requests are slower	< 1000ms	< 300ms	Most common SLO target; good balance of sensitivity
p99	99th percentile -- 1% of requests are slower	< 2000ms	< 1000ms	Tail latency; reveals worst-case user experience
p99.9	99.9th percentile -- 0.1% of requests are slower	< 5000ms	< 3000ms	For high-volume services (millions of requests/day)

Apdex Score Calculation

The Apdex (Application Performance Index) score converts response time measurements into a single number between 0 and 1 that represents user satisfaction. It requires setting a threshold T (e.g., 500ms).

Classification Rules

SatisfiedResponse time <= T (e.g., <= 500ms)

ToleratingResponse time > T and <= 4T (e.g., 500ms-2000ms)

FrustratedResponse time > 4T or request failed (e.g., > 2000ms)

Formula and Example

Apdex = (Satisfied + (Tolerating * 0.5)) / Total Samples

Example (T = 500ms, 1000 requests):
  Satisfied (< 500ms): 850
  Tolerating (500ms-2000ms): 100
  Frustrated (> 2000ms or error): 50

Apdex = (850 + (100 * 0.5)) / 1000
Apdex = 900 / 1000
Apdex = 0.90 (Good)

How Metrics Relate

If This Metric Moves...	Check These	Possible Cause
MTTR increasing	MTTA, MTTD, error rate	Slow detection, complex root causes, insufficient runbooks
MTBF decreasing	Deploy frequency, error rate, change failure rate	Insufficient testing, growing technical debt, scaling issues
p99 latency spiking	p50 latency, error rate, CPU/memory saturation	Resource contention, GC pauses, slow queries, noisy neighbors
Apdex dropping	p95 latency, error rate, throughput	Degraded performance under load, dependency slowdown, partial outage
Error rate climbing	Deploy log, dependency health, latency	Bad deploy, downstream failure, capacity exhaustion

No single metric tells the full story. Monitor metrics in combination and set alerts based on correlated signals to reduce false positives.