Mean time between failures (MTBF)

Why it matters

MTTR tells you how fast you recover. Change Failure Rate tells you how risky your deployments are. But neither tells you how often a service fails. A service with excellent MTTR might still be the biggest contributor to aggregate downtime if it fails constantly. MTBF fills this gap and helps distinguish genuinely stable services from those that just recover quickly.

What to track

Metric	What it tells you
MTTR	How fast you recover from failures
MTBF	How often you fail
CFR	How risky your deployments are

Track MTBF alongside MTTR and CFR to get a complete picture of service stability. A low MTBF combined with a low MTTR signals a service that fails frequently but recovers fast not the same as a genuinely stable service.

How Port helps

Port calculates MTBF over the incident and deployment data already in your software catalog & context lake. By tracking the time between incident creation events for each service, Port surfaces which services are genuinely stable versus those that just have fast recovery times. This metric is calculated automatically no custom scripting or external tooling needed.

Example scenario

During a quarterly review, an engineering leader notices that the payments service has the best MTTR in the organization 12 minutes average. But Port's MTBF data reveals it also has the lowest MTBF: it fails every 3 days on average, contributing more total downtime than any other service. This reframes the investment priority from "improve MTTR" to "reduce failure frequency" a fundamentally different engineering effort.

Recommended guides

Track and show MTBF for services

Why it matters​

What to track​

How Port helps​

Example scenario​

Recommended guides​

Why it matters

What to track

How Port helps

Example scenario

Recommended guides