How BigPanda’s “Open Box Machine Learning” enables Autonomous Digital Operations

Are you tired of all the bridge calls, all-hands-on-deck meetings, and operational “war rooms” to manage and resolve the latest outage that’s impacting customers and business users?

Are you frustrated by the endless list of tickets that were generated from different monitoring tools like Nagios, CloudFront, Splunk, Datadog, and others around the same incident?

As applications are now hyper-connected to APIs, microservices, and multiple data sources, do you need better tools to diagnose the root cause of issues faster and more easily?

These are some of the frustrations of IT Operations and NOC teams that must respond to operational incidents from a growing number of production applications that were newly launched via digital transformation programs, on top of any number of legacy applications that are still business critical. And this is while businesses are demanding more uptime and higher levels of performance because technology has become strategically critical to a larger number of businesses.

I had the opportunity to demo BigPanda, an enterprise grade platform designed to help digital operations teams solve unique challenges in large scale, multi-cloud, high reliability operational environments. Specifically, BigPanda does the following really well:

Aggregates alerts from multiple tools into a single incident that can be tracked and managed holistically
Correlates alerts and reduces IT noise so that IT Operations and NOC teams can quickly focus on real incidents
Enables sharing and tracking of incidents with multiple teams and collaboration tools to help resolve issues faster
Centralizes key performance indicators and enables different teams to view relevant dashboards
Provides advanced query capabilities to aid in finding patterns and identifying root causes

The combination of these capabilities enables IT Operations to reduce KPIs like the mean time to repair (MTTR), while providing operational support to a larger number of applications.

How BigPanda’s open box machine learning reduces complexity

IT Operations often has several monitoring tools, each designed to monitor a different set of systems, networks, storage devices, applications, security concerns and end user experience issues. When there is an issue, it’s very common to have a cascading set of impacted systems, applications, and end users. Each monitoring tool may be sending out its own alerts and often notifying different people.

For example, a single network issue may impact performance of a database that then slows down several dependent microservices and potentially dozens of applications. If these monitors are directly feeding incident management tools, then dozens if not hundreds of tickets may get generated. Worse, if different alerts are going to different people, it may take a long time for them to get on a bridge call, review all the alerts, decide which is the likely root cause, and finally take steps to remediate the issue. Of course, this also terribly inefficient as it often requires bringing in all the experts to help review the data and diagnose the issue and the experts can’t focus on their “day jobs” while the issue is pending.

BigPanda aims to solve this with machine learning. Right after being deployed, BigPanda quickly proposes a set of correlation patterns by analyzing the data from different monitoring tools and alerts that has been enriched with metadata, tags, and other data that may come from the enterprise CMDB. These correlated patterns are the basis for how BigPanda correlates data from multiple monitoring systems.

What makes BigPanda powerful is that the machine learning suggests the correlated patterns, provides evidence backing these suggestions, and then lets the operator adjust the patterns. This “open box machine learning” approach implies that people can improve the machine learning algorithms based on subject matter expertise and business context. For example, correlated patterns can be done separately for development versus production, departmental versus enterprise applications, and by geography.

When there is an issue, these correlation patterns help group multiple alerts into a single, manageable incident. Operators managing incident response can open up one of these incidents and see the list of the underlying alerts. BigPanda uses the aggregate of these alerts to determine the overall severity of the incident. Using a time sequence visualization, operators can often see what systems sent out the initial alerts, the length of each alert, and the repetitiveness of them. These factors help the operator pinpoint a likely root cause a lot faster. Also, because of this correlation, instead of ending up with tens or hundreds of tickets, IT Operations has to deal with just a single ticket for each correlated incident.

Replacing war rooms with devops collaboration

BigPanda takes incident management to the next level by enabling informed collaboration and automation . Once an incident has been captured, operators can use BigPanda’s sharing features to support two-way integration with multiple ticketing systems.

For example, let’s say IT Operations uses ServiceNow for incident management, developers use Jira for their backlog and defect tracking, and business managers view escalated issues on Slack. BigPanda can share the incident with all three tools, and two-way updates keep everyone sync-ed, as the issue is being worked on. Instead of forcing a bunch of people on a call or into a war room, the appropriate people are informed, and the necessary people are called to action. In effect, it supports the tools that teams and enterprises already have, like and use, instead of forcing them to get into a brand new tool.

The sharing capability also supports webhooks, so in addition to keeping people informed, standard operating procedures that are scripted can be triggered. This means that common problems can have automated responses, reducing the number of people and time it takes to resolve these issues.

The benefits of an operational data lake

BigPanda stores enriched, historical IT Operations data that’s also organized by the correlated patterns and other dimensions. This stored data operates as a central IT Operations data lake and enables departmental, functional and personal visual dashboards showing IT Operations key performance indicators (KPIs) and trends. It also offers a powerful query engine, enabling operators to diagnose root causes to issues and identify long term trends.

The combination of all these capabilities is central to transforming an IT Operations group to a Digital Operations group. Using BigPanda, digital operations becomes more data driven by looking at monitoring tools’ data in the aggregate, more responsive as it can diagnose issues faster, and more efficient by enable sharing of relevant information with the right people in real-time. Enterprises operating large IT environments with demanding SLAs should look at BigPanda as a platform that drives performance and efficiency.

This post is brought to you by BigPanda.io

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.io.

Labels

How BigPanda’s “Open Box Machine Learning” enables Autonomous Digital Operations

How BigPanda’s open box machine learning reduces complexity

Replacing war rooms with devops collaboration

The benefits of an operational data lake

No comments:

Post a Comment

Share

About Isaac Sacolick