Improving Customer Experience by Automating Incident Response

The tools and practices of IT Operations have to get better and easier.

Every IT Ops engineer has the job of responding to alerts when a website is down or unresponsive. To restore service, the engineer follows a certain procedure to restart the web server and validate that the website is operational. Maybe it happens again a few days later and another engineer repeats the procedure to restore service. If it happens yet again, a proactive engineer hopefully takes the initiative to develop a simple script that automates this procedure.

Today, the complexity of monitoring and automating responses is both more important and more complex.

Businesses expect SaaS-competitive performance levels from their applications, especially ones that are directly used by customers. Customers not only expect applications to be available, but also want fast, secure, and consistent performance. When there is an issue, customers and leaders expect that IT can resolve the issue very quickly – in a matter of seconds and minutes – not hours.

In addition, many of the underlying application architectures are more complex with applications calling on more services, connecting to more databases, integrating with more data sources, and leveraging more third-party APIs and web components. Managing incident response on these applications is often slow and error prone because of the number of subsystems that need to be reviewed, the number of tools being used to capture operational data, and the complexity in the procedures needed to restore service.

In larger organization, managing multiple applications, web services, and databases in the cloud and on premises can get expensive because both the volume of incidents can be high (and growing every day) and the amount of manual work associated with managing incidents is constantly increasing. Making matters more complicated is the disparity of tools used in incident response by IT operations – and now progressively, digital operations – to monitor, respond, recover and for root cause analysis.

Managing the increased demand and complexity in incident response

Adding more people to the incident response team is becoming a less tenable option for many IT organizations that are being asked to implement and manage more applications as part of their digital transformation programs, but with only marginal increases in the IT operations budgets. 

CIO need to be looking at new options to manage the growing complexity and to lead the transformation of their IT operation to a digital operation.

CIO can do more with less by looking for tools that enable digital operations by

  • Enabling the aggregation of data and analytics from multiple IT operational tools into a single management system.
  • Leveraging open box machine learning tools to process operational data and helping identify systems that are the root cause of an application failure.
  • Automating the response to an increasing number and variety of incidents, to improve customer experience.
  • Measuring the improvements of key performance indicators such as MTTD (mean time to discovery), MTTA (mean time to acknowledge) and MTTR (mean time to repair). 

The complexity behind a single user journey

Let’s look at the diagramed example of one user journey that goes across three different node.js applications, leverages five different microservices deployed as lambda functions, and performs transactions with two RDS databases all deployed to AWS. These databases are also connected by three data pipeline services that are used to send updated data from enterprise systems hosted in a datacenter. The node.js applications also connect to two external APIs and embed two other JavaScript widgets. 

All in, there are twenty different systems that make up this user’s journey, that need to be monitored for incidents. But that doesn’t tell the full story. As shown, there are thirty-five different connections being made and three that go across a VPN between the public cloud’s VPC and the data center. All of these services are being monitored by a myriad of different tools such as AWS Cloudwatch and DataDog on AWS, and SiteScope and Splunk in the data center. In addition, there are two different teams with operational responsibilities - the one for the data center uses ServiceNow while the cloud DevOps team is using Jira Ops.

When there is an incident, the service that sends out the alert is not always the one that’s the root cause of the problem. Let’s say Service 5 is running a slow query on the DB2 database that’s impacting the performance of a handful of queries running through Service 4. Each of these queries isn’t slow enough to trip off an alert, but the aggregate of their performance is slowing down App 3 significantly and it begins to send out an alert.

Without automation, the person in IT Ops responding to this alert needs to check several monitors across CloudWatch and DataDog to investigate the slowness in App 3. She may find the slow queries but will have a hard time pinpointing which service and query started it all. 

What she can’t easily see is that an ETL from Data Pipeline 3 kicked off just before these queries began to slow down. She will totally miss this point because its outside her area of responsibility, and the data center team won’t notice the problem because from their perspective, the data pipeline is running normally.

Meanwhile, customers are suffering. How long will it be, until this mess is sorted out and performance restored?

Leveraging AI and automation in incident response

Now let’s look at this same scenario when there is some automation and open box machine learning in place through an autonomous digital operations platform like BigPanda

With such a platform, alerts from CloudWatch, DataDog, SiteScope, and Splunk are aggregated and then correlated, in real-time, into discrete incidents. This means that for App 3, all the alerts from the underlying services, databases (including the ones in the data center) and the data pipelines are correlated into a single incident. When alerts from App 3 are triggered, the open box machine learning algorithm determines that Service 5’s query was the first performance issue and that it has a dependent data pipeline that is running. The automation also then opens up tickets in Service Now and Jira Ops with these details, to help the digital operations team coordinate, review, and resolve the issue.

Over time, you can expect the automation and AI to improve. For example, after open box machine learning correlates alerts into problematic incidents, the automation could trigger scripts to resolve the issue. 

But as this example illustrates, improving MTTD and MTTR is not trivial when user experience is increasingly tied to many different microservices, databases, services and integrations. The digital operations teams needs to find IT Ops tools that make it easy to integrate with a diverse set of IT systems and monitoring tools, correlate data from all of these sources into actionable incidents, and automate various aspects of incident response. Such tools will maximize the uptime and performance of customer-facing applications and services at all times. 

This post is brought to you by

The views and opinions expressed herein are those of the author and do not necessarily represent the views and opinions of

No comments:

Post a Comment

Comments on this blog are moderated and we do not accept comments that have links to other websites.