Incident management is not just the responsibility of IT Ops teams. DevOps teams aren’t the only ones responsible for fixing defects or reducing technical debt, and SREs aren’t the only ones who pursue performance and reliability improvements, and infosec teams aren’t the only ones concerned about security vulnerabilities.
These are collective responsibilities for responding to issues and ensuring customers and end-users have secure, high-quality, effective, and extraordinary digital experiences. However, in many organizations, connecting the operational dots across teams, responsibilities, and the tech stack isn’t easy despite everyone’s best intentions.
Just consider the recent Fastly outage as an example, where “a customer pushed a valid configuration change that included the specific circumstances that triggered the bug, which caused 85% of our network to return errors.”
Ouch.
And yet, we can sympathize with the underlying complexities that caused the issue, dread the reactions from impacted customers, contemplate their forty-nine-minute recovery time. The good news: Fastly avoided long-term brand damage by taking responsibility quickly and conducting a blameless postmortem about how to improve their end-to-end incident pipeline.
What is an end-to-end Incident Pipeline?
Classic incident management starts with system-generated alerts plus escalations from end-users, centralized in an ITSM tool, and then managed by one or more IT Ops teams. IT Ops and incident management teams aim to detect those alerts, respond and resolve issues with minimal business impacts, ideally without disrupting many technologists to answer questions or take an active role in recovery efforts. Recuring and significant incidents transition to problems, where SREs, DevOps, architects, and others investigate root causes and recommend longer-term remediations.
It’s a very simplistic view of incident management that is outdated for today’s business needs and operational complexities.
Today, an end-to-end incident pipeline must account for mission-critical applications, customer experiences, mobile apps, SaaS integrations, microservice architectures, multicloud infrastructures, big data machine learning experiments, IoT data streams – not to forget growing security factors and increasing compliance regulations. It must enable teams to detect triage, investigate, respond, and remediate alerts in a smart, efficient, and fast process.
Most importantly, it needs to bring all of IT’s functions into one information sharing and collaborative environment where the right people can efficiently address the different types of issues that result in poor experiences and operational risks.
Consider the practices and capabilities that enable multidisciplinary IT organizations to reduce incidents, improve incidents’ mean time to recovery, and deliver more reliable business services. An AIOps platform’s data centralization, machine learning, and automation capabilities drive the collaboration process improvement and culture changes.
Here are the key steps to bring DevOps, SRE, and IT Ops practices on managing
incidents into one holistic view.
- Institute observability standards and patterns in the application development phase, especially when developing technology innovations, application modernizations, microservices, and integrations.
- Centralize alerts and data from monitoring tools with observability data into one hub, and implement event correlation with an AIOps platform. The correlated events establish a bottom-up view of what’s happening across the technology stack, and they become the top-level data used to track, investigate, and resolve incidents.
- Enrich event data with data from the CMDB and use topology meshes to map technology platforms into business services.
- Create new operational baselines based on successful and failing events, then collaborate with business service owners in defining service level objectives (SLOs).
- Track error budgets and use them to help prioritize work with DevOps teams that can improve performance and reliability.
- Establish a center of excellence to implement automations that connect workflows across tools, orchestrate responses to common issues, and address manual operations.
- Optimize by automating incident triage to help prioritize production issues based on business rules and context.
- Leverage an AIOps platform’s open box machine learning to improve event correlation and aid SREs and DevOps teams in performing root cause analysis.
- Transform operational practices, including major incident response, defect prioritization, performance testing, and security incident management to leverage tools, analytics, and integrated workflows modernized with the AIOps platform.
- Establish an operational release cadence and change the culture by leveraging agile methodologies, blameless postmortems, operational standards, and KPI dashboards.
These are key steps, but organizations may choose to implement them in
different orders. For example, an organization with many legacy applications
and mission-critical service expectations might start with centralizing
alerts, implementing automations, and modernizing incident triage. On the
other hand, organizations with cloud-native architectures should institute
observability standards, enrich using topology meshes, define service levels,
and use error budgets to drive operational priorities.
Why Modernizing Incident Management Must Be an IT Leader’s Priority
You might be thinking, “with all the priorities facing IT organizations, why should implementing an AIOps platform and revamping IT’s incident management practices be high on the list?”
Here’s my take. Consider that most if not all IT organizations have an incredible 2x challenge facing them:
- The business demand for releasing new features and capabilities is often two or more times greater than what the DevOps team can deliver reliably and with minimal defects.
- IT supports twice the number of platforms because DevOps teams are developing apps faster than the business decommissions legacy platforms.
- Business leaders’ expectations on faster performance and greater reliability are at least two nines higher the service levels than most IT Ops teams can deliver.
- While security leaders implement remediations to critical risks, at least two times more threats become mainstream problems.
- The speed of technology-driven change and disruption is at least twice the velocity most organizations can drive digital transformation.
- Ops can’t keep up with the growing demand in mission-critical apps, and the tech must help them scale support functions.
- Modernizing applications is a multi-year journey and tools must support legacy, modernized, and multicloud applications.
- AIOps comes in two flavors: AIOps built into single-purpose tools and AIOps that scales across diverse datasets, clouds, platforms, and organizational silos. Choose wisely because AIOps platforms that scale are game-changing.
The implications are that despite all the architecture improvements, testing
efforts, cloud infrastructure reliability, and workflow tools that get
implemented, IT is still going to face incidents and reliability issues.
Only the stakes are higher because of how important technology is to the
business, and the pressure is growing as business leaders have increasing
expectations.
So when I see solutions that consolidate complexity, drive automation, and
provide machine learning capabilities in a mission-critical function, I know
it should be a critically important priority for many organizations.
This post is brought to you by BigPanda
The views and opinions expressed herein are those of the author and do
not necessarily represent the views and opinions of BigPanda.
No comments:
Post a Comment
Comments on this blog are moderated and we do not accept comments that have links to other websites.