IT leaders have been managing a tremendous amount of change over the
past several years and significantly more so over this last year because of
COVID. On the one hand, are all the business-driven changes for improving
customer experiences, enabling machine learning capabilities and improving
workflow efficiencies with automation. The other hand is trying to accelerate
IT skills, processes, and culture to support cloud migrations, DevOps
automations, SRE functions, and
AIOps to
improve system reliability and performance.
There was a lot to learn from
BigPanda’s Resolve ‘21 and Pandapalooza
event, and my first post on it covered
3 AIOps secrets that boost quick business impacts. This post shares where to find quick wins in automation, incident
management, and growing business stakeholder involvement in IT operations.
These leaders also shared many lessons as they’ve adjusted to digital speeds
and leveraged AIOps. Here are some takeaways.
1. Technology is Changing the IT Operating Model
Sean Mack, CIO/CISO at Wiley Publishing, kicks off his session, acknowledging
that some of the past’s tied and true operating IT principles require
reinvention. He states in his opening remarks, “Technology continues to
evolve, and as leaders, we must too. If we don’t continue to evolve as
leaders, we’re sure to stifle the progress of our teams and our businesses.”
He shares examples such as how the shift from unique infrastructures to
ephemeral and disposable cloud environments changes how IT manages and
monitors environments.
Most importantly, technology is now a core business capability, and Sean
states that business and technology are inseparable. Understanding how to
deliver small, incremental capability updates and become more customer-focused
are table stakes for today’s IT organizations.
2. Digital Requires Driving Fast; Technical Debt is the Friction
Nag Vaidyanathan is CTO of OneMain Financial, America’s largest personal
installment loan company, with fifteen-hundred branches, six contact centers,
and ninety-five hundred employees. Nag acknowledged that “Many people think
that because we are an old company, it is very hard for us to make
changes.”
But he goes on to share many examples of how the bank needed to adapt their
loan origination, call routing, pricing, and other practices to adjust to
customer needs during COVID. He confesses, “When I reflect back, it looks
like, how in the world did we do all these things?”
Part of their success included automating CI/CD pipelines, building
loosely-coupled business services, and migrating to cloud-native datastores.
But Nag acknowledges, “You never realize the impact technical debt can have
when you need to accelerate.”
All race cars have to pit to fill the tank, change tires, and check the engine.
Driving a fast digital engine requires IT leaders to make smart
prioritizations on what areas of technical debt create the most friction and
how IT should address them proactively.
3. Agile Practices and Culture Enable Digital’s Velocity
IT leaders recognize that agile practices and culture must extend beyond
application development teams into business functions and IT operations.
Scott Johnson, SVP of Infrastructure as a Service at Equifax, shared many
insights during his panel on becoming an analytics company. “We’ve moved to a
full agile-driven engineering and operations organization and embraced a
product mindset for our products that we deliver to the organization, such as
our certified pipeline.”
Scott’s colleague, Dan Grace, Global Technology Operations Leader at Equifax,
acknowledged the transformation’s scope. “It’s a huge culture shift going from
waterfall to an agile mindset at a one-hundred-year-old company with the
people, technology, and the partners.”
Sean Mack also disclosed how Wiley realigned to an agile organization. “We
moved from rival teams to collaboration and teams of teams. The
cross-functional delivery team includes developers and QA, but also SREs, and
database reliability engineers.”
Part of the realignment requires elevating how people in IT understand
customers and products. Sean recommends that “People can be deeply skilled,
but need a broad sense of the context of their work around the product and
customer.”
4. The Impact of Speed and an Always-On IT Operations
It’s probably time to retire IT Ops terms and practices like scheduled
downtime, blackout periods, and manual failovers. If digital transformations
didn’t change the IT operating model, then surely COVID has accelerated how
important reliable, secure, and high-performance IT systems are to business
operations.
Dan Grace from Equifax shared some of the changes and impacts. They pushed the
gas pedal and decreased the MTTR from the hour recoveries accepted a decade
ago to minutes. It required getting more people certified on their public
cloud technologies and shifting everyone’s mindset that environments are
always on.
Dan states a clear objective, “We have to drive automation into everything
that we do.”
5. Seek Single-Pane Tools to Tame Hybrid Cloud’s Complexities
Hybrid and multicloud may sound sexy, but it adds significant complexities to
IT operations. Tools optimized for one cloud technology may offer productivity
and innovation, but the aggregate of supporting multiple cloud-specific
technologies can become a nightmare for IT operations.
The complexities are most pronounced with monitoring tools. It’s where
AIOps can significantly impact ITSM teams
that must respond and resolve a growing variety of incidents faster and more
reliably.
Scott Johnson of Equifax shared the realities of operating the hybrid cloud.
“Running an always-on cloud-native paradigm as well as running on-prem is an
extremely tough environment to be in. Troubleshooting, event correlation, did
a change something you did on the on-prem side blow up something in the cloud?
Being able to manage in that hybrid state is tough.”
Organizations may have different cloud strategies, but one commonality is the
growing number of monitoring tools used to capture data and alert on problems.
AIOps with open-box machine learning
capabilities helps IT correlate alerts into manageable incidents.
6. The Importance of Emphasizing a Blameless Culture
I’m going to call a spade a spade. We’ve all seen how IT operations get all
the punches thrown at them if there’s an outage, when resolutions take too
long, or why communications during a major incident’s bridge call miss
expectations.
Of all the principles tied to
devops cultures and SRE practices, many presenters at
BigPanda’s Pandapalooza
emphasized the importance of spearheading a blameless culture. Not only does
it promote more positive behaviors in IT, but it’s doubly important to
encourage this behavior with business stakeholders and leaders.
Sean Mack of Wiley rationalized the importance of a blameless culture. “The
focus is on learning and less about preventing mistakes.”
Learnings help IT prioritize fast and longer-term remediations, while
behaviors aimed at preventing mistakes are overly defensive given today’s
importance of system reliability and performance.
7. Simplify to Fewer and Straightforward ITSM KPIs
I know too many CIOs that work hard defining KPIs that are meaningful by
functional areas and disciplines. It’s a tall order instrumenting all the
metrics and processes, scheduling time to review them, and ensuring that
priorities target meaningful improvements.
Sometimes, less is more, and easier can be more meaningful. Nag Vaidyanathan
takes this approach and applies three straightforward ITSM KPIs to measure his
organization’s operational performance: System availability, the mean time to
recovery (MTTR), and the change success rate.
If only driving digital was like driving a car with a few simple dials and AI
to handle complexities. We’re not there yet in IT, but these progressive
leaders are heading in the right direction.
This post is brought to you by BigPanda
The views and opinions expressed herein are those of the author and do not
necessarily represent the views and opinions of BigPanda.