An application is waiting more than three seconds for an API’s response. The response time exceeds the performance requirements for this API, so a monitoring tool triggers an alert that automatically creates an incident ticket. By the time a service manager in IT Ops responds, the API shows acceptable response performance, and the ticket is closed without investigation.
What the service manager doesn’t see is that this is the fifth time in two weeks that this API tripped alerts, and two customer service complaint tickets are likely related to the problem. This IT group isn’t using AIOps to correlate alerts and automate integrations between tools, so recognizing this customer-impacting and recurring problem, triaging the root cause, and prioritizing its remediation is not on anyone’s radar. Instead, IT is investing time to close tickets while customers are complaining.
What are Service Level Objectives and Error Budgets
IT organizations must manage to higher expected service levels while
supporting a mix of cloud-native applications, microservices, and legacy
monolithic applications. But progressive IT organizations, including
several leaders at hundred-year-old companies, are investing in AIOps, establishing SRE practices, changing how DevOps
teams improve application reliability, resolving incidents faster, and
reducing alert fatigue.
I spoke to Jason Walker, field CTO at
BigPanda, about applying SRE
methodologies, measuring Service Level Objectives (SLOs), and managing error
budgets using AIOps capabilities.
Jason acknowledges that more people in IT Ops and their business
stakeholders must understand SRE terminologies and methodologies. He
explains, “Error budgets are a useful way to think about issues in the
context of providing a reliable service. Maybe you’ve decided, “my SLO is
99.9 percent,” and the ratio of failures to attempts is going to be my
service level indicator (SLI). You can only afford one failure for
every 1,000 attempts. That’s your error budget.”
So instead of just measuring failures and capturing service levels measured against time, as in how many alerts per week, service level objectives are calculated differently and capture error events as a percent of the total events.
How Using Error Budgets Reduces Alert Fatigue
Using SLOs can change the business and operational mindset on how to
monitor, what to measure, when to alert, and how IT Ops responds to
incidents.
SREs use burndown reports for monitoring error rates in the same way
developers use this type of report to monitor sprint, release, and epic
burndowns. Alerts are only generated when the burn down exceeds the error
budget for a designated time period. Some groups use predictive algorithms
to also consider whether errors are trending in that direction.
Walker goes on to explain how measuring errors and tracking error budgets
with burndowns changes the approach. He says, “Sustained breaching over that
ratio for a given period or spiking by exceeding the ratio by a significant
amount should trigger an SLO alert so that you can take
action. You can scale it up to the business service level and
measure it down to the microservice level.”
The approach helps reduce alert fatigue, a condition that plagues IT Ops
when issues automatically trigger alerts and send off pagers whenever
there’s an issue. Business leaders can collaborate with IT Ops to define
error budgets with business context, so for example, they may identify
higher SLOs and lower error budgets during peak hours or to support peak
seasons.
Managing Incidents with Error Budgets and AIOps Event Correlation
So, to go back to my example, the first API errors issue probably does not
trigger an alert or record an incident if the SLO for this service was being
met and the error budget was not exceeded. But by the fifth error in
two weeks, chances are the error budget for this service is exceeded and
requires action.
IT Ops teams using AIOps
capabilities have an advantage when measuring error budgets. Let’s say the
API alert triggers other alerts from the consuming microservice and several
downstream applications.
The AIOps open box machine learning algorithms
can
correlate these alerts
and escalate them as one incident ticket to IT Ops. Tools then show the
time-sequence of alerts which
helps IT Ops triage the issue faster, and they can kick off automated responses that address known issues. The
combination of these capabilities allows IT Ops to improve their mean time
to resolution.
IT Ops also benefits by using the
AIOps open integration hub
that connects to ServiceNow, Jira, and Slack. Customer service is
automatically notified of the issue and resolution via Slack, and when the
root SREs determine that the root cause is a code issue, a Jira defect is
created on the appropriate team’s backlog.
How SREs use Error Budgets to Prioritize App Improvements
Error budgets serve as a tool for IT Ops to recognize and prioritize which
alerts require incident management. But SREs also use error budgets to
prioritize which operational issues and technical debt that agile teams
should invest development time to address.
These SREs use error budgets and their burndowns to have a dialog with agile
product owners on prioritization. When business services, applications,
dataops services, or microservices consistently exceed their error budgets,
there should be a rationale to invest in the development effort to address
root causes. On the other hand, if the product owner isn’t prioritizing
remediations, then IT Ops may be justified in reducing the SLOs and managing
to a larger error budget.
SREs using a topology mesh
can show the dependencies and relationships between microservices,
applications, databases, and business services to the product owner and
application architects. So once there is agreement on upgrades and fixing
defects, these maps help illustrate where development teams should focus on
improvements.
Defining SLOs and error budgets is a key practice for IT organizations
implementing digital transformations, hybrid working, cloud migrations, and
other technology investments. Using AIOps in the implementation is a
game-changer as it correlates alerts from multiple sources, streamlines
incident reporting, supports faster issue triage, and enables workflow
integrations.
This post is brought to you by BigPanda
The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.
No comments:
Post a Comment
Comments on this blog are moderated and we do not accept comments that have links to other websites.