Getting the most out of chaos engineering practices

#chaosengineering #incidentresponse #slos

Spreading Some Chaos

In 2022, I created two conference talks, one with my friend Julie Gunderson, focusing on Chaos Engineering and some of the additional benefits teams can reap from building a chaos testing practice. These talks were pretty successful! We feel like they’ve run their course as a live offering, but wanted to recap and share recordings for folks.

If you’re totally new to chaos engineering, you can check out Julie and I talking about it on our Twitch stream, or read this blog post.

Using Chaos Testing for Learning

Getting started with chaos testing can be daunting, but it doesn’t have to be, especially if you are working with an application or service that has been running for a while. When you’re dealing with a service with a history, it’s seen some things! You can take a look at past incidents, errors, bug reports, and other issues that have cropped up in the past few months and use that set of data to start your testing plan.

When we remediate and resolve incidents, we hope to not see the exact same incident again in the future. Make chaos tests part of your plan for preventing future repeats of past incidents by adding chaos tests that mimic past incidents to your regular testing plan. Did a service fail because of a resource constraint, a network outage, a dependency failure? After you’ve done the engineering work to improve the resiliency of the application in those circumstances, verify your improvements are working as expected with a set of chaos tests.

As your team becomes more comfortable with chaos experiments and testing, new applications will reap the benefits. Chaos testing is an investment, so you want to make sure all of your teams are benefiting from what your tests show you. Teams may find the best ways to do this are by putting together guidelines for best practices or documenting preferred methods of defensive coding.

What Are You Getting Out of Chaos Testing

Thinking about why you’re doing chaos testing is an important part of your chaos plan. It’s great to look at just what the impacts are for your users when you are injecting faults into your systems. Chaos engineering is a big investment, though, so expand your plans to get the most out of the work! During chaos experiments, track how your application behaves against your current SLOs, how your incident response process is working, and how your teams are troubleshooting in the live environments in a controlled way.

Verifying SLOs Make Sense

Service level objectives, or SLOs, are the goals your team establishes for your resiliency, performance, and customer experience. Teams set SLOs to ensure that their services will fall within any contractual agreements documented in SLAs, but also to ensure that their services will meet performance expectations of other services in their ecosystems. For teams that work primarily on middle tier or backend services, their performance is key to the success of the frontend, even if their services are not directly in the user’s visible path.

When we employ chaos testing, we can dig into the performance of our services in real world conditions. We can determine if the work our teams have completed continues to meet the established SLOs for the service, even before a feature or enhancement is released to the users. For services that aren’t meeting their SLOs, chaos testing can help illuminate issues and help product and development teams decide if more resources should be committed or if the SLOs aren’t appropriate for the service.

Verifying Incident Response Processes

A secondary benefit to running chaos experiments is verifying that all of the alerting and incident response mechanisms for a service are working as expected. When a chaos test is deployed that mimics a real-world scenario - a service fails, resources are exhausted, a dependency is unavailable - service monitors and alerts should let the team know that something is happening.

Some teams prefer to run much of their chaos testing in non-production environments, or turn off alerting while running tests, but allowing alerts and notifications to fire when a controlled experiment is running gives the whole team a chance to ensure that necessary components are configured correctly!

Practice in Troubleshooting

In addition to the initial parts of your incident response - alerts and notifications - your team can also practice troubleshooting during chaos tests. Folks who will be expected to respond to live incidents can practice finding the correct dashboards, logging in to the correct systems, and finding the right conference call or chat channel for coordination during a low-stress exercise. Plenty of folks find responding to incidents disorienting, especially after regular work hours, so get some extra benefit out of your chaos experiments and give them a chance to practice.

More Resources

Want to watch recordings of these talks? You’re in luck!

DevOpsDays DC, Mandi Walls and Julie Gunderson - Reducing Trauma in Production with SLOs and Chaos Engineering
Conf42 Incident Management 2022, Plan for Unplanned Work: Game Days and Chaos Engineering

The Ops Community ⚙️