The Ops Community ⚙️

Cover image for What is Proactive Ops?
Dave Hall
Dave Hall

Posted on • Originally published at

What is Proactive Ops?

This post was originally posted on under the title What is Proactive Ops?.

At its core, traditional IT Operations is overwhelmingly reactive - humans respond to incidents and tickets. There is so much noise! Alerts, tickets, audits, chat threads and more. What if ops could catch the small things before they turn into big problems? What if your ops team built tools that allowed them to be proactive?

A Proactive Ops team is an IT Ops team engaged in software engineering. There will still be tickets and alerts for the team to deal with, but there should be fewer issues fighting for the team's attention. With most routine issues resolved by scripts, only the tickets that remain need a human to resolve.

Many organisations have embraced some version of the DevOps culture of shared ownership. Some teams took this too far - where all developers have full access to all systems and so everything is always on fire. In other organisations, DevOps became a job title for a sys admin who builds out and deploys to cloud environments. Then there is the enterprise; In this space Ops is often outsourced. Rather than focusing on needs and outcomes, engineers narrowly follow documented processes.

What is Proactive Ops? It isn't yet another Dev + Ops + something = DevSomethingOps idea from a marketing team. This is a different approach to building and managing digital platforms. It is mindset, while also being practical.

There is some overlap between Site Reliability Engineering (SRE) and Proactive Ops. While SRE is effective at managing system level issues, it often overlooks the problems caused by users. Proactive Ops adopts SRE principles, then also applies them to areas where human error must be managed.

In larger organisations the Security Orchestration, Automation and Response (SOAR) platform is a key component of a defence in depth strategy. SOAR uses data from various sources to make decisions and automatically respond to issues as they’re identified. Proactive Ops seeks to generalise SOAR and apply it to general IT operations.

Platform Engineering has already become one of the corporate buzzwords of 2023. Proactive Ops embraces many of the ideas from Platform Engineering, but takes it a step further. A digital transformation platform needs to be supported by an operations team. The less time the team spends supporting the platform, the more time there is for adding features demanded by customers.

Proactive Ops is about internal customers and their experiences with the organisation's tooling and services. The focus is on building services that delight users while protecting the organisation. The Proactive Ops team provides the platform and tooling that powers internal digital transformation.

Guard rails are a key component to protect organisations or as Kathy Sierra said to “make the right things easy and the wrong things hard to do”. When users stray too far off course, tooling is there to immediately remediate the issue. Don't create a ticket or raise an alert, have a script fix the issue and send a post action notification.

This enforcement activity relies on collecting events from internal and external systems. Most SaaS and cloud platforms offer event feeds via webhooks. Obtaining events from internal systems, especially legacy ones, can be challenging. Without this data you will have significant blind spots and more reactive work for the team.

Code powers a Proactive Ops organisation. This includes:

  • Business logic
  • Policy as Code (Open Policy Agent / Sentinel / Sigma)
  • Infrastructure as Code (Terraform / Cloudformation / CDK / SAM / ARM / Ansible / pulumi)
  • Configuration as Code / GitOps ( Flux / Argo / Weave)
  • Docs as Code (git / hugo / jekyll)
  • Credentials in Code not that one

Code alone isn't enough. Customers and other systems need to interact with interfaces. These interfaces can include:

  • self service for resource provisioning and access (backstage / ServiceNow / Jira service manager / custom applications)
  • observability (Open Telemetry / XRay / Grafana / Elastic)
  • reporting and metrics (Athena / data mesh / PowerBI / Tableau)
  • querying data (APIs)

Each of these systems also needs to communicate state changes via events.

Each component of a Proactive Ops stack should adhere to the unix philosophy. "Write programs that do one thing and do it well. Write programs to work together."

In Team Topologies speak, a Proactive Ops team is a platform team. The team needs to embrace Amazon's CTO Werner Vogels' quote; "You build it, you run it". The team needs to have end to end responsibility for the platform, while also collaborating with others to build integrations.

What to Expect

Every Wednesday at 10am UTC I publish a new article on a topic related to Proactive Ops. Some of the posts will discuss a topic or a concept. Others will be more practical. From time to time we will release complete, ready to deploy components or solutions. The content will always be technical.

Patterns are more important than specific tools. When we present a reference implementation, we encourage our readers to understand the pattern and apply it to their toolset. 🌊

Posts are shared on other platforms on delay. If you want the latest content straight to your inbox, subscribe!

Top comments (0)