When do you get to call an incident over so everyone can get back to work? The answer varies from organization to organization, and really depends on which parts of the work you want to measure and improve. When does “resolution” happen?
Herein lies some of the confusion around the many definitions of MTTR.
MTTR can variously represent, depending on the industry, context, or organization:
- Mean time to resolve - How long it takes for the service to be restored to some acceptable level. That’s what we’re usually working with at PagerDuty, and concludes what we refer to as “an incident”.
- Mean time to respond - How long it takes for a responder to be notified and begin investigating.
- Mean time to repair - How long it takes for the problem to be permanently fixed. This comes from manufacturing and isn’t often used in software operations contexts, but it does pop up from time to time.
Depending on which, if any, of these categories impact your customer experience most directly, any are a valid choice for when to mark an incident as resolved.
One of the challenges that trips up organizations dealing with regular incidents is handling the post-incident work. Should the activities following an incident - post-incident review, permanent fixes - be included in how we measure our incident response?
Many organizations will gravitate to the first definition, to resolve incidents when the immediate impact is fixed or mitigated in some way. The folks at Box post this definition in their incident response docs
A designation that the customer experience has returned to expected levels based on the results observed during the monitoring period.
But what if your organization is really good at resolving the immediate impact but very bad about fixing things permanently? Customers might be dealing with degraded experiences for months while teams fight to get resources to implement permanent improvements. Managing incidents towards mean time to repair might make more sense for improving customer experience.
Your organizational definition of MTTR will determine when your team calls an incident done. The mechanics of your underlying tools should be flexible enough to accommodate any of these workflows, so you can train your team in how to best make use of those tools. In PagerDuty, an incident can remain in
acknowledged state for as long as you need it to, though at PagerDuty we generally work towards the first definition of MTTR and mark incidents
resolved when the impacted service no longer presents a bad user experience. This can include resolving the incident using rollbacks, hiding broken features, or fixing forward, depending on the teams’ decisions.
When we have so many different ways to interpret MTTR, the average duration of the time-to-R will vary widely. Teams holding incidents open until the post-incident review is completed, or until all remediation work is deployed will have much longer incident durations than teams that resolve incidents when any immediate or catastrophic user experience degradation is over.
Whatever MTTR works best for your organization, it’s most important to be consistent. There’s no “best” MTTR, but using drastically different metrics would make comparing teams and their incident response improvements vastly different.
Some teams are finding that MTTR isn’t helping them improve their reliability, and are embracing Service Level Objectives (SLOs) as their guiding metrics. PagerDuty’s integrations with SLO tools like Nobl9 help teams stay on top of SLOs and use the right metrics to make decisions about service performance.
PagerDuty’s integrations with other solutions can help your teams manage the incident response workflows they need to engage with.
Customer Service teams utilizing CSOps can manage their workflows in their primary tools and track an MTTR based on tickets they’ve escalated to engineering teams when customers report issues.
Engineering teams using ticketing or kanban type solutions can open work tasks on those systems for post-incident work, including writing the review and designing any necessary permanent fixes.
There are many ways to work through an incident, and how your team manages incidents and the work that occurs downstream from those incidents should be flexible and fit with whatever your team requires. PagerDuty’s integrations with more than 700 solutions means your team can track incidents and the work they create efficiently and effectively.