We all know the saying about data, “garbage in, garbage out.” The same is true for IT trying to reduce alert noise by centralizing messy and heterogenous observability data and monitoring alerts.
But maybe we can look at this differently and consider countering this reality about data quality with a business-impacting approach: “enriched data in, operational results out.” In other words, improve the quality and richness of the underlying alert data, and the more likely you will be able to deliver positive business outcomes through higher-quality event correlation, faster incident triage, and accurate root cause analysis.
Now let’s consider data used in NOCs and IT Ops to monitor systems and resolve incidents that have been normalized and enriched within an AIOps platform. It is the critical foundation that significantly impacts the success of how IT Ops reduces alert noise, triages incidents, reduces the mean time to resolve them, and improves the accuracy of root cause changes. Without going through normalizing, deduping, and enriching alerts with topological and operational data, it’s truly “garbage in, garbage out” that limits the ability of AIOps to eliminate IT noise, surface root cause changes, and automated incident management tasks.
The data scientists and analysts in your organization know all about data cleansing and enrichment. When they bring in multiple data sets about customers, financials, or operations, they must review and improve data quality before analyzing and promoting insights.
In IT Ops, cleansing data is about normalizing common fields from different monitoring tools and observability data sources and deduping redundant data, while enriching it is about joining other data sources such as the CMDB and ITSM tickets.
What are the Types of Data Cleansing and Enrichment
So what kind of data issues are we concerned over in IT Ops? Here are some data cleansing examples:
- Translating differences in naming conventions across technology stack layers
- Filling in data gaps because standards and practices in how people configured alerts and created logs changed over time
- Normalizing multiple hierarchies of alert levels into a standard
- Reducing redundant data through event deduplication
- Grouping and normalizing different error conditions that point to common root causes
- Cleansing information on clouds, data centers, and environments inferred from log file names and storage paths, or how administrators configured monitoring alerts
- Topological information around the infrastructure, networks, databases, and applications from the CMDB or asset management system
- Business service information such as their names and owners
- Contextual information on who and when the end-user performs key transactions
- Service level objectives and rules for mapping transactions to them
- Workflow data from ITSM, agile, PMO, and DevOps tools tied to releases, deployments, tickets, and changes
The ultimate outcome with enriching alerts with metadata across multiple data stacks is that IT teams can increase customer satisfaction and demonstrate improvements in operational KPIs with higher-quality event correlation, noise reduction, and faster MTTR.
Let’s review three examples.
1. Quickly Determine What is Causing Alerts
When end-users or service desk teammates open up incident tickets, how many fields are on the form for capturing where the end-user is reporting an issue? When it’s a system-generated issue, how much context around the various systems with all the dependencies are reflected in a screen that incident managers can use to identify, route, and resolve the issue?
Monitoring systems and observability sources are unlikely to store the full underlying context. But most IT organizations do have a CMDB with some of their infrastructure assets, and advanced CMDBs also store dependency maps and configuration data. By integrating AIOps with ServiceNow or other CMDB, all the configuration information can display in a shared screen to all incident responders so that they have better context on the problem.
Let’s say the NOC receives alerts from four different APIs and one infrastructure service within an AIOps platform. The alert is enriched with CMDB data that shows the infrastructure service is an API proxy service, and requests from all four APIs route through it. The team restores all the services by restarting the proxy. Then, SREs reviewing for root cause changes identify a hung API as the problem, mark the issue as a defect, and recommend other configuration changes to improve the proxy’s error handling.
2. Resolve Issues Collaboratively and Efficiently
An application is on a cloud, running in a cluster, deployed to a container, and has connections to multiple persistent and serverless functions. If these are AWS services, technical tags often include the account, region, availability-zone, autoscaling-group, instant-type, and more specific names, images, and instance ids. Containers managed by Kubernetes may also have common labels such as component, version, and part-of a certain hierarchy.
This configuration information is particularly important in application performance monitoring, and many monitoring tools such as DataDog, Dynatrace, and Splunk capture cloud, systems, application, and service metadata.
But what happens when incidents span multiple services and applications with different monitoring tools? In a recent BigPanda survey on the future of monitoring and AIOps, 42 percent of respondents were using over ten monitoring tools, and 19 percent were using more than twenty-five tools.
So when there’s an application performance or security issue, how many monitoring tools are needed to identify and resolve the problems? How many engineers must join the bridge call, and how long does it take to agree on remediations?
AIOps that load data from these monitoring tools can become the single pane of glass for everyone on the bridge call, but to be successful in reducing alert noise and compressing them into manageable incidents, the raw alerts should be cleansed and enriched with other topological and operational data sources. For example, enriching alerts across the topology provides a common way of reviewing the impact of infrastructure, whether running on AWS, Azure, or a private cloud and operating on VMs or serverless, on an incident.
That can help the NOC avoid a lost in translation moment when there are too many tagging flavors, and that helps teams resolve issues more collaboratively and efficiently.
3. Meeting the Business Service Level Objective
It’s not uncommon for enterprise NOCs to have multiple concurrent major incidents and many P2 incidents, so a key question is how to triage their priorities and where should tier two and three incidents get routed.
Remember the old days when SLAs were labeled platinum, gold, and silver and only had uptime expectations? Today, SREs define more contextual service level objectives (SLOs) that factor in who, when, where, and other business rules align with business expectations. For example, the SLO in an ecommerce application running during the shopping holiday periods should be higher than off-peak seasons.
The NOC can also use these contexts to triage and prioritize incidents, but this requires enriching AIOps with service level contexts. Loading in this data, mapping systems to business services, and translating SLOs to priorities are all enriching steps needed for NOCs in using AIOps to optimize incident response and meet business expectations.
In other words, when an incident maps to a business service with a high SLO, the AIOps platform triages this and prioritizes them for the NOC to resolve.
How does Data Enrichment Work in AIOps?
So how does enrichment help IT resolve issues faster, improve the accuracy of root cause analysis, optimally triage incidents, and reduce alert fatigue?
Think of enrichment as building block data capabilities starting with operations to extract data from multiple sources. There are several enrichment building blocks: extraction helps pull patterns of information from tags, composition enables combining tags, and mapping looks up values in data mapping tables. Together they work to create tags that are used to correlate interrelated alerts into smaller, context-rich incidents.
The cleansed and enriched data helps IT Ops and NOCs normalize the complexities of many architectures, monitoring tools, and service level objectives into a normalized taxonomy. A common AIOps vernacular allows teams to collaborate, triage, resolve, pinpoint causes, and define automation rules faster and accurately.
This post is brought to you by BigPanda.
The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.
No comments:
Post a Comment
Comments on this blog are moderated and we do not accept comments that have links to other websites.