How AIOps Help SREs Measure Error Budgets and Fulfill SLOs

An application is waiting more than three seconds for an API’s response. The response time exceeds the performance requirements for this API, so a monitoring tool triggers an alert that automatically creates an incident ticket. By the time a service manager in IT Ops responds, the API shows acceptable response performance, and the ticket is closed without investigation.

How AIOps Help SREs Measure Error Budgets and Fulfill SLOs - Sacolick

What the service manager doesn’t see is that this is the fifth time in two weeks that this API tripped alerts, and two customer service complaint tickets are likely related to the problem. This IT group isn’t using AIOps to correlate alerts and automate integrations between tools, so recognizing this customer-impacting and recurring problem, triaging the root cause, and prioritizing its remediation is not on anyone’s radar. Instead, IT is investing time to close tickets while customers are complaining.

What are Service Level Objectives and Error Budgets

IT organizations must manage to higher expected service levels while supporting a mix of cloud-native applications, microservices, and legacy monolithic applications. But progressive IT organizations, including several leaders at hundred-year-old companies, are investing in AIOps, establishing SRE practices, changing how DevOps teams improve application reliability, resolving incidents faster, and reducing alert fatigue.

I spoke to Jason Walker, field CTO at BigPanda, about applying SRE methodologies, measuring Service Level Objectives (SLOs), and managing error budgets using AIOps capabilities.

Jason acknowledges that more people in IT Ops and their business stakeholders must understand SRE terminologies and methodologies. He explains, “Error budgets are a useful way to think about issues in the context of providing a reliable service. Maybe you’ve decided, “my SLO is 99.9 percent,” and the ratio of failures to attempts is going to be my service level indicator (SLI). You can only afford one failure for every 1,000 attempts.  That’s your error budget.” 

So instead of just measuring failures and capturing service levels measured against time, as in how many alerts per week, service level objectives are calculated differently and capture error events as a percent of the total events.

How Using Error Budgets Reduces Alert Fatigue

Using SLOs can change the business and operational mindset on how to monitor, what to measure, when to alert, and how IT Ops responds to incidents.

SREs use burndown reports for monitoring error rates in the same way developers use this type of report to monitor sprint, release, and epic burndowns. Alerts are only generated when the burn down exceeds the error budget for a designated time period. Some groups use predictive algorithms to also consider whether errors are trending in that direction.

Walker goes on to explain how measuring errors and tracking error budgets with burndowns changes the approach. He says, “Sustained breaching over that ratio for a given period or spiking by exceeding the ratio by a significant amount should trigger an SLO alert so that you can take action.  You can scale it up to the business service level and measure it down to the microservice level.” 

The approach helps reduce alert fatigue, a condition that plagues IT Ops when issues automatically trigger alerts and send off pagers whenever there’s an issue. Business leaders can collaborate with IT Ops to define error budgets with business context, so for example, they may identify higher SLOs and lower error budgets during peak hours or to support peak seasons.

Managing Incidents with Error Budgets and AIOps Event Correlation

So, to go back to my example, the first API errors issue probably does not trigger an alert or record an incident if the SLO for this service was being met and the error budget was not exceeded.  But by the fifth error in two weeks, chances are the error budget for this service is exceeded and requires action.

IT Ops teams using AIOps capabilities have an advantage when measuring error budgets. Let’s say the API alert triggers other alerts from the consuming microservice and several downstream applications. The AIOps open box machine learning algorithms can correlate these alerts and escalate them as one incident ticket to IT Ops. Tools then show the time-sequence of alerts which helps IT Ops triage the issue faster, and they can kick off automated responses that address known issues. The combination of these capabilities allows IT Ops to improve their mean time to resolution.

IT Ops also benefits by using the AIOps open integration hub that connects to ServiceNow, Jira, and Slack. Customer service is automatically notified of the issue and resolution via Slack, and when the root SREs determine that the root cause is a code issue, a Jira defect is created on the appropriate team’s backlog.

How SREs use Error Budgets to Prioritize App Improvements

Error budgets serve as a tool for IT Ops to recognize and prioritize which alerts require incident management. But SREs also use error budgets to prioritize which operational issues and technical debt that agile teams should invest development time to address.

These SREs use error budgets and their burndowns to have a dialog with agile product owners on prioritization. When business services, applications, dataops services, or microservices consistently exceed their error budgets, there should be a rationale to invest in the development effort to address root causes. On the other hand, if the product owner isn’t prioritizing remediations, then IT Ops may be justified in reducing the SLOs and managing to a larger error budget.

SREs using a topology mesh can show the dependencies and relationships between microservices, applications, databases, and business services to the product owner and application architects. So once there is agreement on upgrades and fixing defects, these maps help illustrate where development teams should focus on improvements.

Defining SLOs and error budgets is a key practice for IT organizations implementing digital transformations, hybrid working, cloud migrations, and other technology investments. Using AIOps in the implementation is a game-changer as it correlates alerts from multiple sources, streamlines incident reporting, supports faster issue triage, and enables workflow integrations.

This post is brought to you by BigPanda

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.

No comments:

Post a Comment

Comments on this blog are moderated and we do not accept comments that have links to other websites.

Share