What do three multibillion-dollar companies that have been around for over one hundred years have in common? There might be straightforward answers if they were in the same industry, but what if one is in media, another in financial services, and a third a food service distributor?
IT leaders from Wiley, OneMain Financial, and US Foods presented at the recent
BigPanda’s Resolve ‘21 and Pandapalooza event about how they’re modernizing their
IT operations with AIOps. I’ve already shared insights from this event,
including
3 AIOps secrets that boost quick business impact
and
seven lessons from IT leaders on operating at digital speeds with AIOps. This post explores how companies that must continually reinvent themselves
use data and machine learning to deliver great IT service management
experiences.
Keep in mind that information technology wasn’t around when these three
companies were founded, and they introduced many of the systems running their
businesses over decades. But at the event, their leaders were presenting how
they were leveraging machine learning and automations to improve the mean time
to recovery (MTTR) from IT incidents and increase the reliability and
performance of their systems.
I was most interested in seeing how these leaders used AIOps and leveraged
data in IT Operations.
Use DevOps to Improve Data Quality
Didier Le Tien, VP of Application Development at US Foods, explained how
having clean operational data was critical to support production
applications. He states, “Changing your process through tools gives you an
opportunity to collect the better quality data needed to prove or disprove
you are on the right track. It’s one of the key elements to be more
data-driven. This data has allowed us to think outside of the box when it
comes to our operations, for example, having the visibility to identify
production issues faster, use data to improve troubleshooting, and then
address potential bugs. Because you have the data, concepts like AIOps
became a reality for us.”
I love these comments because they illustrate
- The importance of creating and cleansing data when instituting new processes and tools
- How having cleansed operational data helps teams think outside of the box
- Their targeted improvement metrics using AIOps and open-box machine learning capabilities
Reduce Alert Fatigue - Automation and Machine Learning
Sam Chatman, VP of IT Ops at OneMain Financial, explains the impact of
levering AIOps is, “Being able to understand what is released, when it’s
released, and the potential impacts of that release. We are overcoming alert
fatigue, and BigPanda will be our Watson of the Enterprise Monitoring Center
(EMC) by automating alerts, opening incident tickets, and identifying those
actions to improve our mean time to recovery. This helps us keep our systems
up when our users and customers need them to be.”
For other organizations, it might help to visualize what naturally happens
to IT operations’ monitoring programs over time. Every time systems go down
and IT gets thrown under the bus for a major incident, they add new
monitoring systems and alerts to improve their response times. As new
multicloud, database, and microservice technologies emerged, they add even
more monitoring tools and increased observability capabilities.
Having more operational data and alerts is a good first step, but then alert
fatigue kicks in when tier-one support teams respond and must make sense
over dozens to thousands of alerts. OneMain has broken that cycle by
establishing an EMC, investing in AIOps, focusing on customer experience,
and addressing alert fatigue.
OneMain Financial’s EMC is relatively new, and they’ve already made significant business impacts. Sam shares one best practice – that overcoming alert fatigue not only requires better data, it also requires tools for automating aspects of the response. The automation improves communications and frees up time so that IT operations can focus on troubleshooting and restoring service. As Sam points out, the shift from tasks to problem-solving helps change everyone’s focus on improving customer and end-user experience.
Enable Actionable Insights - Improve Signal to Noise Ratios
If automation is part of how IT Operations improve recovery times, then
reducing noisy alerts to a correlated and manageable number of incidents is
another best practice. Kiran Venkatesan, Architect at Wiley, shares a core
practice in improving the signal to noise ratio in the data used by IT Ops
for incident management.
Kiran says, “If there is a lot of noise, then there is no benefit. We have
started measuring compression rates in how much noise is generated by event
monitoring tools. How many alerts are duplicated, can be aggregated, or are
correlated? How much of an actionable incident is produced based on all of
the enrichment that is going in within the context of the particular
business service?”
So improving IT operations needs more than cleansed and correlated data, as it must lead to actionable, accurate, and at least partially automated responses. One important step is to map incidents to the impacted business services, define service level objectives, and improve communications.
Better Data Enables Automatic Incident Triage
The next step in the journey goes beyond reducing alert noise, correlating
monitoring data, and enabling response automations. In the middle of the
incident management process are bridge calls, war rooms, and other group
efforts between subject matter experts. Their goal is to work
collaboratively with all the available data and aim to troubleshoot issues,
identify root causes, and prescribe courses of action.
Even as the operational data quality improves, the triage process can be the
longest, most painful step in the incident pipeline.
BigPanda customers talked about ways their IT operations take advantage of
automatic incident triage. Context is automatically added to each incident, including identifying
the impacted business services, the teams who must stay informed, and the
type of issues that need addressing. With this context added to the
incident, first-level teams can then route the incident to the appropriate
support teams. The approach should eliminate the “all hands on deck”
concepts prevalent in IT Ops teams that haven’t invested in AIOps. Helping
IT operations triage incidents is very promising for IT leaders looking
beyond improving MTTR. Proactive leaders also aim to reduce the number of
monthly incidents and enhance IT support personnel’s work-life balance.
When you see that hundred-year-old enterprises recognize the importance of
high system reliability and enable IT operations with AIOps tools to improve
service levels, you sense how important both customer and employee
experiences are to these companies.
When you listen to their leaders, then you get the sense that many IT organizations have much to gain by
improving IT operational data and investing in AIOps.
This post is brought to you by BigPanda
The views and opinions expressed herein are those of the author and do
not necessarily represent the views and opinions of BigPanda.
No comments:
Post a Comment
Comments on this blog are moderated and we do not accept comments that have links to other websites.