Operational Metrics Reference

Key operational metrics for measuring and communicating service reliability. Each metric includes its formula, interpretation, and industry benchmarks where available.

Core Reliability Metrics

MetricFull NameFormulaIndustry Benchmark
MTTRMean Time to RecoveryTotal downtime / Number of incidents< 1 hour for SEV1; < 4 hours for SEV2
MTBFMean Time Between FailuresTotal uptime / Number of failuresVaries by service tier; 30+ days for mature services
MTTAMean Time to AcknowledgeSum(alert time to ack time) / Number of alerts< 5 minutes for SEV1; < 15 minutes for SEV2
MTTDMean Time to DetectSum(incident start to first alert) / Number of incidents< 5 minutes with synthetic monitoring
Error RateError Rate (5xx)(5xx responses / Total responses) * 100< 0.1% normal; > 1% warrants investigation
ApdexApplication Performance Index(Satisfied + Tolerating*0.5) / Total samples0.94+ Excellent; 0.85-0.93 Good; 0.70-0.84 Fair; < 0.70 Poor

Latency Percentiles

Percentile latencies are more meaningful than averages for understanding user experience. An average of 100ms can hide a p99 of 5 seconds that affects 1% of your users on every request.

PercentileMeaningTypical Web App TargetAPI Service TargetWhen to Use
p50Median -- half of requests are faster than this< 200ms< 50msGeneral health indicator; represents typical experience
p9090th percentile -- 10% of requests are slower< 500ms< 150msUseful for capacity planning
p9595th percentile -- 5% of requests are slower< 1000ms< 300msMost common SLO target; good balance of sensitivity
p9999th percentile -- 1% of requests are slower< 2000ms< 1000msTail latency; reveals worst-case user experience
p99.999.9th percentile -- 0.1% of requests are slower< 5000ms< 3000msFor high-volume services (millions of requests/day)

Apdex Score Calculation

The Apdex (Application Performance Index) score converts response time measurements into a single number between 0 and 1 that represents user satisfaction. It requires setting a threshold T (e.g., 500ms).

Classification Rules

SatisfiedResponse time <= T (e.g., <= 500ms)
ToleratingResponse time > T and <= 4T (e.g., 500ms-2000ms)
FrustratedResponse time > 4T or request failed (e.g., > 2000ms)

Formula and Example

Apdex = (Satisfied + (Tolerating * 0.5)) / Total Samples Example (T = 500ms, 1000 requests): Satisfied (< 500ms): 850 Tolerating (500ms-2000ms): 100 Frustrated (> 2000ms or error): 50 Apdex = (850 + (100 * 0.5)) / 1000 Apdex = 900 / 1000 Apdex = 0.90 (Good)

How Metrics Relate

If This Metric Moves...Check ThesePossible Cause
MTTR increasingMTTA, MTTD, error rateSlow detection, complex root causes, insufficient runbooks
MTBF decreasingDeploy frequency, error rate, change failure rateInsufficient testing, growing technical debt, scaling issues
p99 latency spikingp50 latency, error rate, CPU/memory saturationResource contention, GC pauses, slow queries, noisy neighbors
Apdex droppingp95 latency, error rate, throughputDegraded performance under load, dependency slowdown, partial outage
Error rate climbingDeploy log, dependency health, latencyBad deploy, downstream failure, capacity exhaustion

No single metric tells the full story. Monitor metrics in combination and set alerts based on correlated signals to reduce false positives.