The Ops Community ⚙️: Haviv Rosh

Operational Excellence in a Cloud-Native World: No-Code Automation and DevOps

Haviv Rosh — Fri, 02 Dec 2022 17:35:53 +0000

In “What is Operational Excellence?,” the first post in this series, we defined what operational excellence means in today’s modern, cloud-native world. Consulting two popular frameworks, AWS Well-Architected and DORA’s five metrics, we determined how to appropriately measure DevOps efficiency and effectiveness. After learning from both of these frameworks, we compiled a list of speed (performance), scalability, and reliability as key indicators of operational excellence.

Today, cloud engineering teams are now responsible for managing hundreds of cloud tools and services across different environments. Getting all these cloud services to work together requires major configuration and maintenance effort. For most teams, that means manual integration projects, many dependencies, and writing glue code.

But, it doesn’t have to stay this way. No-code automation is changing how operations teams build their cloud workflows. Platforms like Blink now come with purpose-built automations for different cloud tools and services, reducing the effort required to build new workflows. In the Blink Automation Library, there are over 5000+ cloud automations available for teams to deploy today.

What Does DevOps Automation Mean Today?

The way cloud engineering teams think about “DevOps automation” has shifted over the last few years. Today, DevOps automation means more than just setting up CI/CD pipelines.

Widespread adoption of CI/CD tools has led to a misguided belief that DevOps are primarily responsible for integration and delivery workflows. But these activities all occur before a code is deployed into production. DevOps are also responsible for the reliable operations and maintenance of in-production cloud applications, involving tasks that have their own complex workflows. Many of these workflows involve manual processes that are not easily or adequately solved by CI/CD tools.

For example, how are you supposed to use Jenkins or any other CI/CD platforms to solve these kinds of problems?

AWS:

Azure:

GCP:

And that’s just considering operational tasks related to the major cloud providers. Don’t forget about identity management, security, observability, incident response, communication, and other third-party tools necessary to running business applications. The unfortunate reality is that CI/CD and IaC tools cannot run operational or business workflows because they are unable to react to events that happen in the cloud (such as new resources being created, new vulnerabilities or incidents, etc..).

Without an automation platform dedicated to managing operational workflows and business processes, DevOps engineers are left to navigate serverless/microservices architectures themselves. When it comes to building global operational workflows, manual scripts and CI/CD hacks won’t cut it for achieving reliability objectives or meeting customer SLAs.

Breaking the Cycle of DevOps Burnout Culture

Lack of a dedicated platform for maintaining cloud-native workflows transfers operational burden to DevOps engineers who must stitch solutions together manually. Workflows are slow to build and brittle to run. Updates take significant development time and effort.

Wasn’t DevOps about improving engineering culture and efficiency?

AWS states that “DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity.”

By now, many organizations have adopted the cultural philosophies and practices of DevOps. Agile planning and tools are common, along with agile-based development. But manually written scripts are still scattered in Git repositories, APIs still get manually glued together, and plugin updates are fraught processes that risk costly downtime.

Your level of commitment to DevOps philosophies isn’t enough when your tools and practices don’t support your workflows.

DevOps culture is about more than just making a commitment to DevOps methodology. It is the real experience of being a DevOps contributor on a software development team. For most teams, that means lots of stress, too much work, and mountains of distracting service requests from developers and business teams.

The daily experience for DevOps engineers is filled with:

Context-switching: Every day, DevOps engineers get notifications from monitoring tools, incident management platforms, project management tools, and communications platforms like Slack. They continuously receive urgent inbound service requests, get assigned on-call duty, and are still accountable for finishing their scheduled work. All the while, DevOps engineers must log in and out of different cloud tools, context-switching between different tasks and platforms costing significant time and cognitive overhead.
Stress and burnout: DevOps engineers experience an overall lack of control about what they’re working on day-to-day. With many demands on their time and too few skilled engineers to get everything done, DevOps practitioners are especially prone to burnout and churn. According to the DORA research team, having good team communication is a major factor for DevOps success. They found that “stable teams where information flows freely have lower levels of burnout.” Meanwhile, those affected by poor organizational communication are often the most vulnerable, as “employees from underrepresented groups reported higher levels of burnout.”
Poor knowledge transfer: Many operational processes and workflows lack proper documentation. When automations exist, they’re often only usable by the DevOps engineer who built them. Sometimes, DevOps engineers are unaware a relevant workflow already exists elsewhere in their organization and end up duplicating effort. This problem is exacerbated when employees leave the organization, taking valuable institutional knowledge with them. Meanwhile, skilled DevOps engineers are more difficult than ever to hire and retain.

Breaking the cycle of DevOps burnout culture requires being realistic about the demands being placed on DevOps teams and contributors. It’s critical that leaders establish clear expectations with teams and individual contributors up front, and continuously check in with direct reports to ensure they remain aligned on the correct objectives. Including DevOps stakeholders in decision making processes early and often ensures that hands-on operational wisdom is being considered during planning processes. Lastly, it’s important to prioritize producing complete and comprehensive documentation in order to aid knowledge transfer and reduce burnout.

Platform Engineering, Internal Developer Portals (IDPs), and Self-Service Automation

This past October, Gartner published an article on platform engineering, which is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.” This effort is deeply rooted in business objectives. Garter defines “the goal is a frictionless, self-service developer experience that offers the right capabilities to enable developers and others to produce valuable software with as little overhead as possible.”

Recently, we’ve seen growing popularity, both commercially and in the open source world, of what’s been termed internal developer portals (IDPs). These are user interfaces that allow developers to request services on-demand. IDPs improve internal developer experience for an organization, but they are limited in the types of workflows you can create. For example, anytime a developer needs a new development environment, they are able to request on on-demand.

Garter found that “initial platform-building efforts often begin with internal developer portals, as these are most mature. IDPs provide a curated set of tools, capabilities and processes. They are selected by subject matter experts and packaged for easy consumption by development teams. The platform team, in close consultation with the developers they support, must determine which approach is best for their unique circumstances.” However, the limitations of IDPs mean only very specific, developer-focused cloud workflows are solved.

With Blink, we decided to extend the utility of an IDP to all of an organization’s operational workflows. Using the Blink Self-Service Portal, you share automations that empower users to request permissions, provision cloud environments, onboard or offboard team members, initiate password resets, automate software installations, and many more workflows common to cloud-native teams. Blink provides a single system-of-action for DevOps engineers to build all the workflows that enable business teams and their whole organization, in addition to developers.

Blink delivers a more collaborative operational model that relevant automations are always available for internal teams on-demand. The Blink Self-Service Portal makes it easy to proactively support your coworkers, speed up business processes, and frees you up to focus on other projects.

What is Operational Excellence in DevOps?

At the end of the day, the best software engineering teams build better, faster, more reliable applications. Their internal operations workflows are a competitive advantage that helps them remain agile while scaling their business reliably.

So what do speed, scalability, and reliability truly mean from a DevOps perspective?

Speed and DevOps

Looking at outcomes, speed means being agile in response to changing customer expectations and new technologies. Businesses want to be fast to adopt and integrate with new technologies. This is a competitive advantage, where speed is critical. Integrating new cloud tools or services is costly in terms of time and effort required. Integrations are typically tedious, manual processes and require onboarding time to learn the vocabulary and nuances of a different tool or platform. Businesses who more rapidly adopt new cloud technologies will deliver new features and capabilities faster, gain better insights, and outperform competitors.

Speed also means operational efficiency. Three of the five DORA metrics are directly related to speed; Deployment frequency, Lead time for changes, and Time to restore service. Most cloud-native teams have adopted CI/CD and IaC tools in order to solve inefficiencies in these areas.

Speed also takes the form of improved SLAs for customers and mean-time-to-response (MTTR) when troubleshooting performance or security issues. Offering faster, more reliable services is a competitive advantage. Responding to incidents faster prevents outages and costly downtime. According to the DORA research team, 28% of respondents take 1-7 days time to restore service when experiencing stability issues. An additional 21% of the lowest performers take between 1-6 months to resolve an issue!

Scalability and DevOps

Scalability affects multiple different objectives in a DevOps context. From a DevOps perspective, it’s helpful to evaluate your ability to scale across three different axes:

Scalability of processes

Does your existing processes scale to accommodate increased demand or team growth?
When failures occur, is documentation readily available and actionable to resolve issues?
Is there an established process for new integrations or creating new workflows?

Scalability of infrastructure

Do you have established workflows for scaling infrastructure up or out?
Does your infrastructure accommodate rapid fluctuations in demand?
How do you manage cloud costs at scale?
Do you have processes in place to prevent unnecessary cloud spend?
What workflows are in place to ensure outages are avoided or quickly resolved?

Scalability of communications

How many communication channels does your organization use?
How difficult is it to coordinate across teams or channels?
How difficult is it to create actionable alerts for relevant stakeholders?
Do DevOps engineers know where to find relevant information?

No-code automation makes it easier to scale your cloud infrastructure, while being agile to the operational challenges of managing distributed cloud applications at scale. In a world of microservices and countless cloud tools, it’s more important than ever for DevOps engineers to leverage automation to abstract away ever increasing complexity.

Reliability and DevOps

Reliability is both an outcome, but also a predictor of organizational excellence. The DORA research team found that “both the practices we associate with reliability engineering and the extent to which people report meeting their reliability expectations are powerful predictors of high levels of organizational performance.” The authors recommend prioritizing having clear reliability goals, and making sure those goals tie back concrete and measurable reliability metrics.

Clear reliability goals help businesses create defensive value by delivering dependable services over time and establishing trust with customers. Furthermore, having clear reliability goals helps ensure better team communication and leads to less DevOps churn. Reliable operations workflows also create offensive value, by enabling businesses to achieve new, better, faster outcomes. Having clear reliability goals helps organizations reduce context-switching, leading to less burnout, happier teams, and better overall communication practices.

Additionally, by creating the processes and systems necessary to ensure reliable operations of your platform, you’re providing peace-of-mind for your DevOps and SREs that they are supporting a healthy system. While there are always bound to be outages, having clear expectations and processes for your DevOps team ensures greater reliability for your platform and applications.

Try Blink today

Blink enables DevOps, SecOps, and FinOps to achieve operational excellence by making it easy to create automated workflows across the cloud platforms and services they use every day. The impact of adopting a no-code automation platform like Blink is happier, more productive development teams and more reliable, resilient cloud operations.

The best part? The no-code future for cloud operations is available today. Sign up to create a Blink account.

Operational Excellence in a Cloud-Native World: What is Operational Excellence?

Haviv Rosh — Tue, 01 Nov 2022 19:29:27 +0000

Last decade, many of the responsibilities that now belong to DevOps previously belonged to the IT and Security departments. Managing data storage and resources, ensuring availability and resilience of application services, troubleshooting, incident response & remediation, security, and admin operations were all traditionally handled by IT and security teams.

When software infrastructure moved to the cloud, the responsibilities of operations teams shifted from traditional IT operations to cloud-native operations. Organizations now adopt hundreds of different cloud platforms and services. In response, formerly centralized IT teams became decentralized DevOps. IT shifted-left to meet the needs of developers.

Today’s Chaotic DevOps Landscape

Today, cloud operators like DevOps, SecOps, and FinOps find themselves crushed under their daily work.

Organizations rely on hundreds of different cloud platforms and services. Cloud infrastructure is expensive to operate and maintain. Costly inefficiencies like unused cloud resources stack up new charges every month. There’s too much manual, repetitive work leading to human error and there aren’t enough skilled cloud engineer resources. New DevOps are hard to find and retain.

Security vulnerabilities go unnoticed, dependencies go unpatched, and unmanaged scripts expose sensitive data like passwords or credentials. Demoralized by a backlog of open service tickets, cloud operations teams are burned out. DevOps today is unsustainable.

So what does operational excellence even mean in this cloud-native world?

How do cloud operations teams rise above platform overload and achieve optimum efficiency?

In this series of blog posts, we’ll explore operational excellence through the perspectives of three different cloud operations centers; DevOps, SecOps, and FinOps.

Through each perspective, we’ll try to find the patterns and inefficiencies that cause friction for cloud operations teams today. We’ll also highlight common strategies for overcoming these challenges, by identifying opportunities where automation can improve the daily experience for both developers and cloud platform operators.

Defining “Operational Excellence” in a Cloud-Native World

No two organizations’ infrastructure stacks are the same, but you’d be surprised how similar their objectives are from a DevOps and broader operational perspective. They want to keep costs low, without wasting unnecessary resources on operations processes or infrastructure. Additionally, they want to invest in technologies and workflows that maximize efficiency, employee empowerment, and future profit.

So what does it mean to be operationally excellent?

One of the most ubiquitous methodologies for evaluating cloud architecture and any related operations is the AWS Well-Architected Framework. In this framework, AWS defines six pillars intended to help “cloud architects build secure, high-performing, resilient, and efficient infrastructure.”

The AWS Well-Architected pillars are:

Operational excellence
Security
Reliability
Performance efficiency
Cost optimization
Sustainability

What should stick out immediately is that AWS lists operational excellence as its own pillar. AWS scopes this pillar to include “running and monitoring systems, and continually improving processes and procedures.” For example, they include tasks such as “automating changes, responding to events, and defining standards to manage daily operations.”

But on further inspection, don’t the other pillars also have to do with operational excellence? Would an organization be considered operationally excellent if they had strong processes, but their infrastructure was unsecure, unreliable, inefficient, and unsustainable? Clearly this is not the case. AWS Well-Architected is a good starting point, but we should seek out a more comprehensive definition.

Another influential framework for evaluating DevOps team effectiveness comes from the DevOps Research and Assessment (DORA) team at Google, commonly referred to as DORA metrics. When using the DORA framework, there are five key metrics to consider. This includes a recent update last year, which added “operational performance” as a new metric.

The five DORA metrics are:

Deployment frequency
How often does your team deploy code to production or ship new software?
Mean change lead time
When you commit new code, how long does it take for the code to make it to production?
Change failure rate:
When you deploy changes to code or hot fixes, what is the percentage of time that those changes cause a failure in production?
Mean time to recovery (MTTR)
When failure occurs that impacts customers, how long does it take on average to restore service?
Operational performance
How reliable is your platform? How resilient is it to fluctuating demands or unexpected occurrences?

Unlike the AWS Well-Architected framework, which concentrates on how you should build your infrastructure to ensure efficient and reliable performance, DORA instead focuses on the performance of your development and operations teams and processes.

The reality is that operational excellence concerns both how you architect your infrastructure and the effectiveness of your development and infrastructure operations teams, as well as your internal operations processes.

Putting the Ops Back into DevOps

It was timely, though maybe not too surprising that the research team at Google chose to include operational performance as a new DORA metric last year. According to this year’s report, they evaluate operational performance based “on reliability, which is how well your services meet user expectations, such as availability and performance.” This fifth DORA metric was added to account “so that availability, latency, performance, and scalability would be more broadly represented” alongside the other four metrics.

Today, operational performance is even more important. In addition to economic pressures, businesses face rising cloud bills and their business teams are adopting more cloud tools than ever before. This creates operational complexity and maintenance challenges, thus decreasing operational reliability. Organizations need to take a holistic approach to their cloud operations and identify solutions that bridge insights and workflows across all their different cloud tools.

Achieving Operational Excellence in DevOps

Let’s take a moment to recap what we’ve learned. Paraphrasing AWS Well-Architected, it’s important that cloud-native teams are able to monitor, secure, and reliably operate their cloud infrastructure. Furthermore, it’s important for teams to be able to do so efficiently and effectively.

DORA takes these concepts and applies specific metrics to them. An operationally excellent team should be able to deploy code frequently and make changes rapidly. Failure should occur infrequently, and when it does occur, teams are able to respond and recover quickly. Measuring these different metrics gives teams objective indicators as to how their team stacks up.

Here are some considerations when evaluating your own operational processes:

Speed

How quickly can you deploy new code or integrate with new services?
How long does it take to implement new workflows?
How quickly can your team respond to failure conditions (MTTR)?

Scale

Can your infrastructure meet the demands of your customers?
Can existing processes exist to efficiently support rapid increases in demand?

Reliability

How frequently does failure occur?
How resilient are you to failure conditions?

For example, elite DevOps performers should be able to deploy new code on-demand (multiple times per day). Changes should take under an hour to review and merge to production, and failures should occur less than 15% of the time. When failures do occur, elite teams should be able to respond and restore service within an hour, even for the most complex scenarios.

At the highest possible level, operational excellence means continually optimizing for the speed, scale, and reliability of your infrastructure, as well as the teams and operational processes necessary to support that infrastructure.

No-Code Automation Reduces DevOps Complexity

There’s an enormous amount of complexity and manual effort required for DevOps, SecOps, and FinOps to manage an enterprise-scale cloud application today.

Your average platform or DevOps team likely operates, at a minimum:

One or more public clouds
Code repository
Database(s)
Authentication service
Observability tool(s)
Security monitoring tool(s)
Incident management system(s)
Mobile device management platform

Every one of these tools comes with its own API, documentation, vocabulary, and required developer skills. That’s why it’s no longer sustainable for organizations to rely solely on cloud engineers to create and maintain operational workflows. The platforms are too numerous, and there are too few skilled cloud engineers to implement all the specialized workflows needed to maintain modern cloud infrastructure. Even for the most elite cloud engineering teams, countless hours are still wasted on redundant integration efforts or manually creating one-off workflows.

Even if you ignore the security and operations nightmare that creates, it still doesn’t make sense for valuable DevOps, SecOps, and FinOps to waste time rebuilding the same scripts used by every other organization. By adopting a no-code automation platform like Blink, cloud operations teams can take advantage of existing integrations with popular cloud tools and APIs.

No-code/low-code automation platforms give teams a unified system-of-action for all their workflows, with cloud and security best practices already built-in. This removes much of the manual effort, freeing cloud engineers to create automations that address everyday business challenges like infrastructure management, incident response, cost optimization processes and more. Furthermore, having a centralized platform for cloud operations makes it possible to expose operational workflows as self-service automations for development and business teams.

Next: Operational Excellence in DevOps, SecOps, and FinOps

In our next three posts, we’ll explore operational excellence within the context of DevOps, SecOps, and FinOps, individually. We’ll cover concrete workflows that cloud operations teams are responsible for, and discuss how no-code automation can enable unprecedented efficiencies, security control, and cost savings.

Try Blink today

The best part? The no-code future for cloud operations is available today. Sign up to create a Blink account.