In “What is Operational Excellence?,” the first post in this series, we defined what operational excellence means in today’s modern, cloud-native world. Consulting two popular frameworks, AWS Well-Architected and DORA’s five metrics, we determined how to appropriately measure DevOps efficiency and effectiveness. After learning from both of these frameworks, we compiled a list of speed (performance), scalability, and reliability as key indicators of operational excellence.
Today, cloud engineering teams are now responsible for managing hundreds of cloud tools and services across different environments. Getting all these cloud services to work together requires major configuration and maintenance effort. For most teams, that means manual integration projects, many dependencies, and writing glue code.
But, it doesn’t have to stay this way. No-code automation is changing how operations teams build their cloud workflows. Platforms like Blink now come with purpose-built automations for different cloud tools and services, reducing the effort required to build new workflows. In the Blink Automation Library, there are over 5000+ cloud automations available for teams to deploy today.
The way cloud engineering teams think about “DevOps automation” has shifted over the last few years. Today, DevOps automation means more than just setting up CI/CD pipelines.
Widespread adoption of CI/CD tools has led to a misguided belief that DevOps are primarily responsible for integration and delivery workflows. But these activities all occur before a code is deployed into production. DevOps are also responsible for the reliable operations and maintenance of in-production cloud applications, involving tasks that have their own complex workflows. Many of these workflows involve manual processes that are not easily or adequately solved by CI/CD tools.
For example, how are you supposed to use Jenkins or any other CI/CD platforms to solve these kinds of problems?
- Finding and Deleting Unused AWS IAM Roles
- Finding and Removing Old EBS Snapshots
- Finding and Resizing Amazon EC2 Instances with Low CPU Usage
- Finding EC2 Instances Scheduled To Retire Soon
- Enforcing Mandatory Tags Across Your AWS Resources
- Scaling Down AWS EKS Clusters at Night
- Finding and Deleting Unattached Disks
- Finding and Resizing Azure Virtual Machines with Low CPU Usage
- Finding and Disabling Non-Active Users in Azure
- Finding and Removing Unused Azure Virtual Network Gateways
- Pausing Your AKS Clusters Nightly
- Enforcing Mandatory Tags Across Your Azure Resources
- Pausing Your GKE Cluster at Night
- Finding and Resizing GCP Compute Instances with Low CPU Usage
- Identifying Long Running GCP Instances and Applying Committed Use Discounts
- Finding and Removing Unattached GCP External IP Addresses
- Finding and Deleting Unattached GCP Disks
- Enforcing Labels and Tags Across Your GCP Resources
And that’s just considering operational tasks related to the major cloud providers. Don’t forget about identity management, security, observability, incident response, communication, and other third-party tools necessary to running business applications. The unfortunate reality is that CI/CD and IaC tools cannot run operational or business workflows because they are unable to react to events that happen in the cloud (such as new resources being created, new vulnerabilities or incidents, etc..).
Without an automation platform dedicated to managing operational workflows and business processes, DevOps engineers are left to navigate serverless/microservices architectures themselves. When it comes to building global operational workflows, manual scripts and CI/CD hacks won’t cut it for achieving reliability objectives or meeting customer SLAs.
Lack of a dedicated platform for maintaining cloud-native workflows transfers operational burden to DevOps engineers who must stitch solutions together manually. Workflows are slow to build and brittle to run. Updates take significant development time and effort.
Wasn’t DevOps about improving engineering culture and efficiency?
AWS states that “DevOps is the combination of cultural philosophies, practices, and tools that increases an organization’s ability to deliver applications and services at high velocity.”
By now, many organizations have adopted the cultural philosophies and practices of DevOps. Agile planning and tools are common, along with agile-based development. But manually written scripts are still scattered in Git repositories, APIs still get manually glued together, and plugin updates are fraught processes that risk costly downtime.
Your level of commitment to DevOps philosophies isn’t enough when your tools and practices don’t support your workflows.
DevOps culture is about more than just making a commitment to DevOps methodology. It is the real experience of being a DevOps contributor on a software development team. For most teams, that means lots of stress, too much work, and mountains of distracting service requests from developers and business teams.
The daily experience for DevOps engineers is filled with:
Context-switching: Every day, DevOps engineers get notifications from monitoring tools, incident management platforms, project management tools, and communications platforms like Slack. They continuously receive urgent inbound service requests, get assigned on-call duty, and are still accountable for finishing their scheduled work. All the while, DevOps engineers must log in and out of different cloud tools, context-switching between different tasks and platforms costing significant time and cognitive overhead.
Stress and burnout: DevOps engineers experience an overall lack of control about what they’re working on day-to-day. With many demands on their time and too few skilled engineers to get everything done, DevOps practitioners are especially prone to burnout and churn. According to the DORA research team, having good team communication is a major factor for DevOps success. They found that “stable teams where information flows freely have lower levels of burnout.” Meanwhile, those affected by poor organizational communication are often the most vulnerable, as “employees from underrepresented groups reported higher levels of burnout.”
Poor knowledge transfer: Many operational processes and workflows lack proper documentation. When automations exist, they’re often only usable by the DevOps engineer who built them. Sometimes, DevOps engineers are unaware a relevant workflow already exists elsewhere in their organization and end up duplicating effort. This problem is exacerbated when employees leave the organization, taking valuable institutional knowledge with them. Meanwhile, skilled DevOps engineers are more difficult than ever to hire and retain.
Breaking the cycle of DevOps burnout culture requires being realistic about the demands being placed on DevOps teams and contributors. It’s critical that leaders establish clear expectations with teams and individual contributors up front, and continuously check in with direct reports to ensure they remain aligned on the correct objectives. Including DevOps stakeholders in decision making processes early and often ensures that hands-on operational wisdom is being considered during planning processes. Lastly, it’s important to prioritize producing complete and comprehensive documentation in order to aid knowledge transfer and reduce burnout.
This past October, Gartner published an article on platform engineering, which is an emerging trend within digital transformation efforts that “improves developer experience and productivity by providing self-service capabilities with automated infrastructure operations.” This effort is deeply rooted in business objectives. Garter defines “the goal is a frictionless, self-service developer experience that offers the right capabilities to enable developers and others to produce valuable software with as little overhead as possible.”
Recently, we’ve seen growing popularity, both commercially and in the open source world, of what’s been termed internal developer portals (IDPs). These are user interfaces that allow developers to request services on-demand. IDPs improve internal developer experience for an organization, but they are limited in the types of workflows you can create. For example, anytime a developer needs a new development environment, they are able to request on on-demand.
Garter found that “initial platform-building efforts often begin with internal developer portals, as these are most mature. IDPs provide a curated set of tools, capabilities and processes. They are selected by subject matter experts and packaged for easy consumption by development teams. The platform team, in close consultation with the developers they support, must determine which approach is best for their unique circumstances.” However, the limitations of IDPs mean only very specific, developer-focused cloud workflows are solved.
With Blink, we decided to extend the utility of an IDP to all of an organization’s operational workflows. Using the Blink Self-Service Portal, you share automations that empower users to request permissions, provision cloud environments, onboard or offboard team members, initiate password resets, automate software installations, and many more workflows common to cloud-native teams. Blink provides a single system-of-action for DevOps engineers to build all the workflows that enable business teams and their whole organization, in addition to developers.
Blink delivers a more collaborative operational model that relevant automations are always available for internal teams on-demand. The Blink Self-Service Portal makes it easy to proactively support your coworkers, speed up business processes, and frees you up to focus on other projects.
At the end of the day, the best software engineering teams build better, faster, more reliable applications. Their internal operations workflows are a competitive advantage that helps them remain agile while scaling their business reliably.
So what do speed, scalability, and reliability truly mean from a DevOps perspective?
Looking at outcomes, speed means being agile in response to changing customer expectations and new technologies. Businesses want to be fast to adopt and integrate with new technologies. This is a competitive advantage, where speed is critical. Integrating new cloud tools or services is costly in terms of time and effort required. Integrations are typically tedious, manual processes and require onboarding time to learn the vocabulary and nuances of a different tool or platform. Businesses who more rapidly adopt new cloud technologies will deliver new features and capabilities faster, gain better insights, and outperform competitors.
Speed also means operational efficiency. Three of the five DORA metrics are directly related to speed; Deployment frequency, Lead time for changes, and Time to restore service. Most cloud-native teams have adopted CI/CD and IaC tools in order to solve inefficiencies in these areas.
Speed also takes the form of improved SLAs for customers and mean-time-to-response (MTTR) when troubleshooting performance or security issues. Offering faster, more reliable services is a competitive advantage. Responding to incidents faster prevents outages and costly downtime. According to the DORA research team, 28% of respondents take 1-7 days time to restore service when experiencing stability issues. An additional 21% of the lowest performers take between 1-6 months to resolve an issue!
Scalability affects multiple different objectives in a DevOps context. From a DevOps perspective, it’s helpful to evaluate your ability to scale across three different axes:
Scalability of processes
- Does your existing processes scale to accommodate increased demand or team growth?
- When failures occur, is documentation readily available and actionable to resolve issues?
- Is there an established process for new integrations or creating new workflows?
Scalability of infrastructure
- Do you have established workflows for scaling infrastructure up or out?
- Does your infrastructure accommodate rapid fluctuations in demand?
- How do you manage cloud costs at scale?
- Do you have processes in place to prevent unnecessary cloud spend?
- What workflows are in place to ensure outages are avoided or quickly resolved?
Scalability of communications
- How many communication channels does your organization use?
- How difficult is it to coordinate across teams or channels?
- How difficult is it to create actionable alerts for relevant stakeholders?
- Do DevOps engineers know where to find relevant information?
No-code automation makes it easier to scale your cloud infrastructure, while being agile to the operational challenges of managing distributed cloud applications at scale. In a world of microservices and countless cloud tools, it’s more important than ever for DevOps engineers to leverage automation to abstract away ever increasing complexity.
Reliability is both an outcome, but also a predictor of organizational excellence. The DORA research team found that “both the practices we associate with reliability engineering and the extent to which people report meeting their reliability expectations are powerful predictors of high levels of organizational performance.” The authors recommend prioritizing having clear reliability goals, and making sure those goals tie back concrete and measurable reliability metrics.
Clear reliability goals help businesses create defensive value by delivering dependable services over time and establishing trust with customers. Furthermore, having clear reliability goals helps ensure better team communication and leads to less DevOps churn. Reliable operations workflows also create offensive value, by enabling businesses to achieve new, better, faster outcomes. Having clear reliability goals helps organizations reduce context-switching, leading to less burnout, happier teams, and better overall communication practices.
Additionally, by creating the processes and systems necessary to ensure reliable operations of your platform, you’re providing peace-of-mind for your DevOps and SREs that they are supporting a healthy system. While there are always bound to be outages, having clear expectations and processes for your DevOps team ensures greater reliability for your platform and applications.
Blink enables DevOps, SecOps, and FinOps to achieve operational excellence by making it easy to create automated workflows across the cloud platforms and services they use every day. The impact of adopting a no-code automation platform like Blink is happier, more productive development teams and more reliable, resilient cloud operations.
The best part? The no-code future for cloud operations is available today. Sign up to create a Blink account.