Is AIOps the answer to DevOps teams' ops prayers?

DevOps teams have a two-front battle to keep enterprise and customer facing applications, databases, APIs, and data integrations stable, performing optimally, and secure.

On the one hand, there are all the proactive developments that DevOps team want to, and need to, spend their time on including automating CI/CD pipelines, configuring new infrastructure as code, patching environments, addressing security vulnerabilities, and

On the other hand, there’s the day to day work of responding to outages, disruptions and incidents, performing root cause analysis, and reviewing key operational KPIs and analytics.

Can AIOps, the ability for DevOps, IT Ops and NOC teams to leverage AI and machine learning in IT operations, help DevOps teams shift more of their time from incident response to more strategic work?

How AIOps can augment DevOps teams 


Ask any DevOps team, and they will disclose their struggles to reduce time spent on operational activities so that they can automate more and take on other projects. More than just the time, it’s the distraction and the damage that incidents cause to the team’s reputation.

And when there are critical outages, disruptions and incidents, it’s not just DevOps teams that are impacted! Customers, users, suppliers, and their peers are all disrupted…while various teams scramble to address issues, communicate status, and in the process, push out strategic, time-sensitive work.

What DevOps teams want is clear visibility into the health and performance of the applications. If and when something breaks or degrades, they need to know right away where they should focus, what needs to be done to resolve the issue and if necessary, escalate the issue to the experts that can help resolve it. 

In other words, DevOps teams need three key capabilities in order to improve system reliability, reduce DevOps team effort, and manage today’s highly-complex multi-cloud IT environment: 

  • Leverage all the existing monitoring tools that capture data on the health, performance and availability of applications and the infrastructure layer (hybrid/multi-cloud, networks, databases, data integrations, and security). 
  • Intelligence powered by machine learning, to autonomously cross-correlate information into a manageable number of incidents. It should also provide actionable insight on root cause.
  • Integrated with collaboration and workflow tools, so that issues can be routed to the appropriate teams and managed inside the tools being used by those teams, such as Slack, Jira, and ServiceNow.  

BigPanda is an AIOps tool that demonstrates these capabilities, but as opposed to boiling the ocean, BigPanda focuses on the layer that sits between the monitoring tools and collaboration tools, which it calls autonomous operations. In other words, it doesn’t replace your existing monitoring tools. It aggregates, normalizes, enriches and correlates the data collected from them. Instead of managing dozens to hundreds of alerts around a single incident, it correlates and sequences the monitoring information and alerts into a single, manageable incident. This consolidation removes a lot of the noise around an incident. In the process, BigPanda makes it easier to visualize the sequence of events and diagnose root cause. It’s important to note that BigPanda is not another workflow tool. It connects to existing enterprise tools and acts as the hub for two-way communication around the incident until it is resolved. 

The benefits of BigPanda Autonomous Operations Platform


I am certain you have been paged multiple times in your career into a crowded war room to address a critical issue. Representatives from the DevOps team, L3 engineers overseeing networks, systems, clouds, and the application, and others jump on a call or actually sit in a room and look at different monitors and dashboards to diagnose and resolve the issue. None of the information is correlated, so deciding whether a storage, network, user activity, or other issue is the root cause requires everyone to perform a highly coordinated, painful, manual diagnosis.

The complexity brings on many issues. 

How many minutes or hours does it take to identify a cause? How many wrong turns does the team take in attempt to resolve the issue? How many customers and end-users are impacted? What’s the total business impact to revenue and reputation? What about the operational disruption? 

For DevOps teams, how many projects and how many releases get delayed by the aggregate of time dedicated to resolve complex incidents?

Now picture an environment managed by BigPanda’s autonomous operations platform. Alerts and data from different monitoring tools such as Nagios, New Relic, AppDynamics, and Splunk are aggregated and correlated into a single incident. The incident is identified as an application issues based on exceptions and errors found in one of the application logs. The alert is routed to Jira to the correct application team responsible for the microservice that’s logging the issue. The DevOps engineer on this team recognizes that the microservice requires additional resources and resolves the issue. 

The rest of the DevOps team has visibility to the issue but is only pulled off-task for issues in their domains.

Growing IT complexity requires autonomous operations


DevOps organizations being tasked to manage applications in multiple clouds, at higher service levels, with growing numbers of integrations points, and with higher volumes of data need a smarter, faster approach to managing incidents.  Using machine learning to correlate, investigate and route incidents can help DevOps team resolve issues faster and free up time to work on strategic objectives.

This post is brought to you by BigPanda.io

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of BigPanda.io.

No comments:

Post a Comment

Comments on this blog are moderated and we do not accept comments that have links to other websites.

Share