MTTR Data Can Be Underestimating or Overestimating Incident Management Capabilities

What is MTTR in incident management?

Mean time to resolution (MTTR) is a crucial service-level metric for incident management teams. This metric helps organizations evaluate the average amount of time between when an incident is reported and when an incident is fully resolved. However, measuring MTTR also has some important limitations. For example, considering only time-based methods of calculating MTTR that exclude scope-based factors that affect resolution time.

Scoping is a technique used to evaluate, isolate, and assess the resource costs associated with an incident, A common technique that involves scoping is to identify and estimate repetitive incidents with similar characteristics.

It’s also useful for comparing the financial impact associated with resolving similar incidents. Plus, the idea that scope-based incident management metrics are important in and of themselves.

This is for the reason that scope and incident frequency have a strong correlation. In short, frequency of incidents at any given site will increase as an organization’s disaster recovery plans grow more comprehensive.

For example, consider the case of two organizations with equally complex IT frameworks. However, one company is primarily focused on providing 24x7x365 access to their customers, while the other company’s primary concern is ensuring that their infrastructure receives the necessary maintenance. These two companies will end up with vastly different outage frequency figures.

The fact is that official MTTR statistics are not always reflective of reality. This is because MTTR is typically calculated by dividing the time it takes to resolve incidents by the total number of incidents reported. This gives rise to an “apples to apples” comparison between incident management teams.

However, two incident management teams with different disaster recovery frameworks and priorities will likely have different official MTTR figures. This exposes an important gap in the data collection process. Consequently, it makes it harder to determine which incident management team is actually “best” at resolving incidents.

The Challenge of Comparing Incident Management Teams

This is because all incident management teams have their own unique incident management resources, and all incident management teams work within the constraints of their own unique disaster recovery framework. However, time-based MTTR metrics can end up being misleading in certain situations. For example, consider the case where two different IR teams see 50 incidents for the first time. However, the first team must respond to 50 incidents, while the second team only manages a single incident.

Assuming both teams have the same time-to-completion for resolving the incidents, the second team will have an official MTTR of zero (i.e. it takes zero time to resolve a single incident). However, the first team will have MTTR of 50 incidents. This means that the first team’s average time-to-resolution metric will be five times greater than the second team’s.

This tells a misleading story about the ability of these two teams to manage incidents. Nevertheless, both of these teams are responsible for managing the same level of incidents. So, which team is doing a better job of managing incidents? After all, both teams have identical ability to resolve incidents in a timely manner.

MTTR Data Can Be Underestimating or Overestimating Incident Management Capabilities

The truth is that it’s impossible to say who is handling their incidents better. It’s possible that the second incident management team is actually handling issues associated with 50 separate incidents in a fraction of the time that it takes the first incident management team. This means that the second team’s MTTR statistic could be proportionally much higher when compared with the first team, even though both teams are handling the same amount of incident volume.

Fortunately, other incident management metrics like the time-to-resolution-per-incident can help provide more reliable insight. However, this still suffers from the same limitations as time-to-completion metrics because it doesn’t include a straightforward way to measure scope.

After all, some incidents are more difficult to resolve than others. And, this makes it more difficult to make time-to-resolution measurements. For example, consider the scenario above where the second incident management team manages just a single incident. However, it turns out that the single incident is a more complex or higher-profile problem than the incidents addressed by the first incident management team.

However, this time-to-resolution measurement for the second incident management team will still be disproportionately lower than the first incident management team’s time-to-resolution measure. Plus, the second incident management team will still have a zero-time-to-resolution figure.

Uncovering More Effective Alternatives to MTTR

The fact of the matter is that time- to-completion and time-to-resolution metrics are one-dimensional, and provide little insight into how well an incident management team is handling its incidents. Consequently, it’s important to rely upon other incident management metrics that provide a more comprehensive picture. For example, here at Glance Networks, we use the Operations Effectiveness metric to provide a more complete picture of incident management effectiveness.

The Operations Effectiveness metric is a measure of the efficiency of an incident management team. It’s often taken a step further to become Performance Efficiency. Fortunately, this helps provide insight into the effectiveness of the incident management team at resolving incidents. And, the Operations Effectiveness metric can be applied to better understand the efficiency of a specific incident management team.

The Performance Efficiency metric is a measure of how the effectiveness of an incident management team. This is because it compares the effectiveness of performing incident management activities with the resources used to perform these activities.

Conclusion

Together, the Operations Effectiveness metric and the Performance Efficiency metric provide a more thorough and insightful method of measuring incident management effectiveness. And, it shows an accurate picture of how an incident management team performs over time.

While the Operations Effectiveness metric measures how well incident management activities are performed and the resources used to perform those activities, the Performance Efficiency metric is a measure of the efficiency of incident management. This is because it combines the time it takes to perform both the incident management activities and the resources used to perform these activities.

An Operations Effectiveness/Performance Efficiency measurement can be used to determine how well incident management resources are utilized.

Post Views: 613