If you’ve ever been responsible for software running in production, you should already be well aware of failure. But it’s not only traditional “operations” roles that are affected by failure.
Consider some of the changes we’ve seen in IT in the last decade:
- 2003: SRE - Site Reliability Engineer, a discipline created at Google that incorporates aspects of software engineering and applies them to IT operations problems.
- 2006: AWS EC2 - The birth of the cloud “Amazon Elastic Compute Cloud (EC2)” which made Infrastructure-as-a-Service available to developers around the globe via APIs. It paved the way for CloudNative, Serverless, PaaS and the hundreds of services we have available to us today.
- 2009: DevOps - The combination of software development (Dev) with information technology operations (Ops)”.
It’s clear there have been changes in the people and teams who need to be involved in running software for customers. At the same time there have been significant changes to how we build, test and deploy software.
Culture is a common topic that comes up when we talk about things like DevOps, Lean Software Development, and SRE but what do we mean when we say culture? Do we just give developers pizza and beer until they produce good software and hope they don’t leave for another company offering gourmet pizza and craft beer?
To understand the origins that paved the way for DevOps, SRE, and the “Culture/Transformation” we’re hearing so much about in IT we need to go back, back to 1990 and back to car manufacturing. Consider the following quote from https://en.wikipedia.org/wiki/Lean_manufacturing
Lean manufacturing makes obvious what adds value, by reducing everything else (which is not adding value). This management philosophy is derived mostly from the Toyota Production System (TPS) and identified as “lean” only in the 1990s
Steps to achieve lean systems
The following steps should be implemented to create the ideal lean manufacturing system:
- Design a simple manufacturing system
- Recognise that there is always room for improvement
- Continuously improve the lean manufacturing system design
Sound familiar? These themes are present in Lean, Agile and DevOps.
The andon cord — Game-changing rope
Back in the early days of the Toyota Production System (TPS), Toyota implemented many unique tools and processes. The Andon Cord was born, and while it was just a rope it was the most powerful rope you’ve ever seen.
Anyone had the right to pull the cord at any time and pulling it would instantly stop all work on the assembly line. The technology behind this wasn’t revolutionary, but the thinking, culture and trust behind this system is something we can learn from today. The “Safety Culture” around this process was reinforced when the team leader arrived at the workstation, thanking the team member who pulled the Cord.
The team member did not, and never would be, in a position of feeling fear or retribution for stopping the production line
Toyota has so much to teach us about working together an respecting each other. Even if the team leader knew a better answer, Toyota principally felt that a solution from the team member was a better outcome.
Leaving cars behind and getting back to software, Mary and Tom Poppendeick’s work on Lean Software development in the early 2000s paved the way for the sort of thinking coming out of Google (SRE), and one great example of that is postmortem culture.
Writing a postmortem is not punishment — it is a learning opportunity for the entire company.
As an IT professional with a decade of experience I’ve seen how the culture of the company can significantly impact how people, teams and the organisation grow and improve. The great opportunity that exists when failure happens is learning. The companies and teams that truely believe this are the ones that will continuously improve, both their software and their culture.
Also posted on medium as The cost of failure is education.