Haviv Rosh for Blink Ops

Posted on Nov 1, 2022 • Edited on Dec 2, 2022 • Originally published at blinkops.com

Operational Excellence in a Cloud-Native World: What is Operational Excellence?

#cloudops #devops #secops #finops

Last decade, many of the responsibilities that now belong to DevOps previously belonged to the IT and Security departments. Managing data storage and resources, ensuring availability and resilience of application services, troubleshooting, incident response & remediation, security, and admin operations were all traditionally handled by IT and security teams.

When software infrastructure moved to the cloud, the responsibilities of operations teams shifted from traditional IT operations to cloud-native operations. Organizations now adopt hundreds of different cloud platforms and services. In response, formerly centralized IT teams became decentralized DevOps. IT shifted-left to meet the needs of developers.

Today’s Chaotic DevOps Landscape

Today, cloud operators like DevOps, SecOps, and FinOps find themselves crushed under their daily work.

Organizations rely on hundreds of different cloud platforms and services. Cloud infrastructure is expensive to operate and maintain. Costly inefficiencies like unused cloud resources stack up new charges every month. There’s too much manual, repetitive work leading to human error and there aren’t enough skilled cloud engineer resources. New DevOps are hard to find and retain.

Security vulnerabilities go unnoticed, dependencies go unpatched, and unmanaged scripts expose sensitive data like passwords or credentials. Demoralized by a backlog of open service tickets, cloud operations teams are burned out. DevOps today is unsustainable.

So what does operational excellence even mean in this cloud-native world?

How do cloud operations teams rise above platform overload and achieve optimum efficiency?

In this series of blog posts, we’ll explore operational excellence through the perspectives of three different cloud operations centers; DevOps, SecOps, and FinOps.

Through each perspective, we’ll try to find the patterns and inefficiencies that cause friction for cloud operations teams today. We’ll also highlight common strategies for overcoming these challenges, by identifying opportunities where automation can improve the daily experience for both developers and cloud platform operators.

Defining “Operational Excellence” in a Cloud-Native World

No two organizations’ infrastructure stacks are the same, but you’d be surprised how similar their objectives are from a DevOps and broader operational perspective. They want to keep costs low, without wasting unnecessary resources on operations processes or infrastructure. Additionally, they want to invest in technologies and workflows that maximize efficiency, employee empowerment, and future profit.

So what does it mean to be operationally excellent?

One of the most ubiquitous methodologies for evaluating cloud architecture and any related operations is the AWS Well-Architected Framework. In this framework, AWS defines six pillars intended to help “cloud architects build secure, high-performing, resilient, and efficient infrastructure.”

The AWS Well-Architected pillars are:

Operational excellence
Security
Reliability
Performance efficiency
Cost optimization
Sustainability

What should stick out immediately is that AWS lists operational excellence as its own pillar. AWS scopes this pillar to include “running and monitoring systems, and continually improving processes and procedures.” For example, they include tasks such as “automating changes, responding to events, and defining standards to manage daily operations.”

But on further inspection, don’t the other pillars also have to do with operational excellence? Would an organization be considered operationally excellent if they had strong processes, but their infrastructure was unsecure, unreliable, inefficient, and unsustainable? Clearly this is not the case. AWS Well-Architected is a good starting point, but we should seek out a more comprehensive definition.

Another influential framework for evaluating DevOps team effectiveness comes from the DevOps Research and Assessment (DORA) team at Google, commonly referred to as DORA metrics. When using the DORA framework, there are five key metrics to consider. This includes a recent update last year, which added “operational performance” as a new metric.

The five DORA metrics are:

Deployment frequency
How often does your team deploy code to production or ship new software?
Mean change lead time
When you commit new code, how long does it take for the code to make it to production?
Change failure rate:
When you deploy changes to code or hot fixes, what is the percentage of time that those changes cause a failure in production?
Mean time to recovery (MTTR)
When failure occurs that impacts customers, how long does it take on average to restore service?
Operational performance
How reliable is your platform? How resilient is it to fluctuating demands or unexpected occurrences?

Unlike the AWS Well-Architected framework, which concentrates on how you should build your infrastructure to ensure efficient and reliable performance, DORA instead focuses on the performance of your development and operations teams and processes.

The reality is that operational excellence concerns both how you architect your infrastructure and the effectiveness of your development and infrastructure operations teams, as well as your internal operations processes.

Putting the Ops Back into DevOps

It was timely, though maybe not too surprising that the research team at Google chose to include operational performance as a new DORA metric last year. According to this year’s report, they evaluate operational performance based “on reliability, which is how well your services meet user expectations, such as availability and performance.” This fifth DORA metric was added to account “so that availability, latency, performance, and scalability would be more broadly represented” alongside the other four metrics.

Today, operational performance is even more important. In addition to economic pressures, businesses face rising cloud bills and their business teams are adopting more cloud tools than ever before. This creates operational complexity and maintenance challenges, thus decreasing operational reliability. Organizations need to take a holistic approach to their cloud operations and identify solutions that bridge insights and workflows across all their different cloud tools.

Achieving Operational Excellence in DevOps

Let’s take a moment to recap what we’ve learned. Paraphrasing AWS Well-Architected, it’s important that cloud-native teams are able to monitor, secure, and reliably operate their cloud infrastructure. Furthermore, it’s important for teams to be able to do so efficiently and effectively.

DORA takes these concepts and applies specific metrics to them. An operationally excellent team should be able to deploy code frequently and make changes rapidly. Failure should occur infrequently, and when it does occur, teams are able to respond and recover quickly. Measuring these different metrics gives teams objective indicators as to how their team stacks up.

Here are some considerations when evaluating your own operational processes:

Speed

How quickly can you deploy new code or integrate with new services?
How long does it take to implement new workflows?
How quickly can your team respond to failure conditions (MTTR)?

Scale

Can your infrastructure meet the demands of your customers?
Can existing processes exist to efficiently support rapid increases in demand?

Reliability

How frequently does failure occur?
How resilient are you to failure conditions?

For example, elite DevOps performers should be able to deploy new code on-demand (multiple times per day). Changes should take under an hour to review and merge to production, and failures should occur less than 15% of the time. When failures do occur, elite teams should be able to respond and restore service within an hour, even for the most complex scenarios.

At the highest possible level, operational excellence means continually optimizing for the speed, scale, and reliability of your infrastructure, as well as the teams and operational processes necessary to support that infrastructure.

No-Code Automation Reduces DevOps Complexity

There’s an enormous amount of complexity and manual effort required for DevOps, SecOps, and FinOps to manage an enterprise-scale cloud application today.

Your average platform or DevOps team likely operates, at a minimum:

One or more public clouds
Code repository
Database(s)
Authentication service
Observability tool(s)
Security monitoring tool(s)
Incident management system(s)
Mobile device management platform

Every one of these tools comes with its own API, documentation, vocabulary, and required developer skills. That’s why it’s no longer sustainable for organizations to rely solely on cloud engineers to create and maintain operational workflows. The platforms are too numerous, and there are too few skilled cloud engineers to implement all the specialized workflows needed to maintain modern cloud infrastructure. Even for the most elite cloud engineering teams, countless hours are still wasted on redundant integration efforts or manually creating one-off workflows.

Even if you ignore the security and operations nightmare that creates, it still doesn’t make sense for valuable DevOps, SecOps, and FinOps to waste time rebuilding the same scripts used by every other organization. By adopting a no-code automation platform like Blink, cloud operations teams can take advantage of existing integrations with popular cloud tools and APIs.

No-code/low-code automation platforms give teams a unified system-of-action for all their workflows, with cloud and security best practices already built-in. This removes much of the manual effort, freeing cloud engineers to create automations that address everyday business challenges like infrastructure management, incident response, cost optimization processes and more. Furthermore, having a centralized platform for cloud operations makes it possible to expose operational workflows as self-service automations for development and business teams.

Next: Operational Excellence in DevOps, SecOps, and FinOps

In our next three posts, we’ll explore operational excellence within the context of DevOps, SecOps, and FinOps, individually. We’ll cover concrete workflows that cloud operations teams are responsible for, and discuss how no-code automation can enable unprecedented efficiencies, security control, and cost savings.

Try Blink today

Blink enables DevOps, SecOps, and FinOps to achieve operational excellence by making it easy to create automated workflows across the cloud platforms and services they use every day. The impact of adopting a no-code automation platform like Blink is happier, more productive development teams and more reliable, resilient cloud operations.

The best part? The no-code future for cloud operations is available today. Sign up to create a Blink account.

Top comments (2)

Brad Johnson • Nov 1 '22

@lnxchk I'm really hoping for your feedback on this blog post I helped write.

Has the definition of "operational excellence" changed in the last few years as teams started relying on more cloud-native tools?

What do you think @cloudyadvice? Have you noticed this shift at work?

Mandi Walls • Nov 4 '22

Hi @johnson_brad! Nice explainer.

I don't think the strategic definition has changed; the focus is still reliability and delivering the best customer experience. Users don't really care how we get there. :)

But teams definitely don't have the time to deal with every dial and setting in every tool they have to use. I open my Okta some days and I feel like I don't know what half of those apps are. There's just so much stuff. Anything that makes the job easier without sacrificing accountability and reliability is important!