5 Reasons Major P1 Incidents Have Terribly Long Resolution Times

One would think that with all the investments in developing cloud-native applications and microservices – with all the steps IT has taken to modernize applications for the cloud – and with all the monitoring tools we’ve instrumented across our hybrid clouds – that IT Ops and incident managers would see fewer Priority 1 (P1) incidents and that their resolution times would be decreasing.

AIOps and incident management - Isaac Sacolick

Unfortunately, that’s not the case, and in StarCIO’s recently published AIOps Benchmark Report on how AIOps is the operating platform for digital transformation, 25 percent of respondents said their P1 resolution times usually take over six hours to resolve.

Newsflash – Businesses Expect Higher Reliability and Performance

There’s no way IT and business leaders should allow six-hour or longer P1 resolutions as acceptable.

That six-plus hour P1 results in downtime or poor performance in a customer or employee experience. Long-running P1s can result in lost revenue, added costs, lower customer satisfaction, frustrated employees, and burned-out IT staffers. Just see how Facebook’s recent outage is estimated to have cost the company over $100 million.

Today, business leaders have little tolerance for outages lasting six hours or longer. Many IT systems run in the cloud, so there’s an expectation that the infrastructure is always available and that IT deploys robust applications that won’t go down. So one reason we have more P1s today is that businesses have raised the bar to require better performance, especially during peak business periods.

Let’s consider some business impacts and reasons to avoid incidents, poor system performance, and outages.

Retailers can’t have lengthy e-commerce system outages during holiday shopping periods
Manufacturers can’t slow down and are under pressure to deliver more products faster
Financial institutions are under significant pressure to improve customer experiences
SaaS technology companies have customers that expect near-zero downtime
Airlines and hospitality businesses can’t afford poor performance in mission-critical systems
Online gaming customers will lose loyal fans if games are slow or have repeating failures
All companies driving digital transformations can’t afford outages that disrupt innovation teams

Long P1s? AIOps Has the Answers

Respondents highlighted several factors contributing to longer resolutions and why implementing an AIOps platform is key to their strategy for reducing P1 incident mean time to resolution (MTTR). In our research, 93 percent of respondents implement AIOps or plan too soon, and MTTR was one of the top KPIs identified for measurement and improvement.

Respondents shared many reasons why resolving P1s is harder today.

Complexities in Supporting Hybrid Architectures require IT Ops teams to retain skills, tools, and procedures to support public cloud, data center, and edge computing infrastructures. Also, applications often include cloud-native architectures such as serverless and microservices, legacy enterprise systems, SaaS, low-code platforms, and the integrations connecting them. When there’s a P1, it usually requires diagnosing performance across multiple systems and monitoring tools, requiring more people and time to identify P1 root causes. AIOps addresses these complexities by centralizing visibility across enterprise hybrid stacks.
Fewer Skilled People to Resolve Major Incidents is the top concern of incident management and IT Ops teams reported by over 50 percent of respondents. One problem is ensuring the knowledge transfer required to support legacy systems, then finding the more advanced cloud ops and SRE skillsets to support cloud-native architectures. AIOps addresses this gap by enabling remote IT operations and consolidating IT Ops, NOC, and DevOps views needed to resolve incidents.
Increased DevOps-Driven Deployment Frequencies help deliver new capabilities and fix defects to end-customers faster, but also increase the risks of introducing performance, reliability, and security issues. Respondents are automating CI/CD and IaC to improve changes but see investments in AIOps as the guard rails to digital transformation initiatives.
Complexities in Resolving P1s with Hybrid Working Teams is a factor because of the added time needed to get everyone on bridge calls, Zooms, Microsoft Teams, or other collaboration tools. Then, outside the NOCs and war rooms, incident management teams need more time to discuss findings, agree on root causes, and define action plans. AIOps platforms that provide an open integration hub enable connecting workflows across tools and promote information sharing needed by hybrid working teams.
More Monitoring Tools and Events to Review increases the number of people involved in P1s and lengthens the time to review all the alerts. Incident management teams seek an AIOps platform with event correlation, a machine learning algorithm that connects events into manageable incidents, and a single pane of glass where everyone can review the time-sequenced monitoring and observability events.

The research identifies three primary AIOps capabilities that directly address P1 incident detection, triage, and resolution time. While automation is important, machine learning capabilities in event correlation, enrichment, and triage are key for helping incident management and IT Ops teams. These capabilities help Ops prioritize incidents, simplify root cause analysis, reduce the number of people responding to P1s, and provide the tools to resolve incidents quickly and accurately. In other words, AIOps helps reduce the number and severity of P1s.

So while data processing, analytics, applications, automations, customer experiences, and employee workflows are more important to every business, AIOps is the primary investment to ensure that performance and reliability don’t fall behind business needs.

Read the AIOps Benchmark Report for more details!

This post is brought to you by BigPanda.

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.

Labels

5 Reasons Major P1 Incidents Have Terribly Long Resolution Times

Newsflash – Businesses Expect Higher Reliability and Performance

Long P1s? AIOps Has the Answers

No comments:

Post a Comment

Share

About Isaac Sacolick