Figuring out how to mobilize and organize multiple teams to respond to a major incident can take some trial and error. Unfortunately, errors can hurt your business and lead to longer resolution times.
You can mobilize responders and keep stakeholders informed using PagerDuty’s Incident Workflows, but how do you decide what actions to add to your workflow? We’ll provide some examples here based on how PagerDuty does PagerDuty for Major Incident Response.
Anyone at PagerDuty can declare a Major Incident, based on an existing incident. Often, these are initiated when a single team has investigated an issue on their services and found the behaviors impact or derive from other services, and other teams need to investigate. Major Incidents can also be declared directly when someone finds a problem or something is reported by a customer.
Some incidents are more important than others. Your organization should work toward a universal understanding of what constitutes a major incident for your environment. An incident might be classified as “major” if it impacts a certain number of customers, reaches a predetermined duration, includes an important function, or any number of other criteria. You can use incident priorities to classify incidents within your PagerDuty account, and create different workflows based on your defined Priority or Severity categories.
It’s ok to be wrong about the priority of an incident while it’s happening, but it’s better to over-respond rather than under-respond. We refer to this as never hesitate to escalate.
Getting responders to the right place for a major incident is key to keeping your response and resolution times low. Responders shouldn’t be confused about where the activities are taking place or how to find the right channel in your chat application.
For major incidents, using a single, permanent channel to coordinate the response will help all of the responders in your organization know exactly where to go when they are added to an incident. Think of this channel as your predetermined meeting place for an emergency drill.
In PagerDuty, you can add a new channel to an incident using Incident Workflows. We have an Incident Workflow titled “Major Incident Workflow” for exactly this purpose. When someone determines that an incident has reached SEV-2 or SEV-1 based on our criteria, a responder can run the Incident Workflow, and several actions take place:
Additional responders are added to the incident, including our customer liaisons.
Stakeholders are added to the incident. These folks have opted in to being notified, as well as leadership, customer-facing teams, and others who might need to stay informed of an incident, but who won’t be actively working on the response.
A Slack channel is added. This is our major incident room, and it is the same channel for every Major Incident. Other incidents might have automatically-generated channels, but Major Incidents are always handled in this channel.
The predictability of this method helps responders to major incidents get to the right place fast.
Not everyone who wants to know about an incident wants to be on the always-notify stakeholder list included in the Incident Workflow above. Some folks in the organization might only be interested in an incident if it touches on their services, or a group of services. There also might just be curious folks who want to know what is going on. That’s great! But they probably aren’t going to get much helpful information hanging out in the response channel. For these folks, we have a read-only incident-updates channel.
While a Major Incident is running, any updates added to the incident will go to the specified Slack channel. In the setup to integrate PagerDuty with Slack, you can set up these types of channels and decide what kind of updates the channel should receive. You can have any number of these connections for your Slack workspace.
Like the major incident room, the incident updates channel is always used for status updates to major incidents, so stakeholders can always find it.
When a major incident is resolved, there are a number of activities that responders need to complete, including creating documents and setting a time for the post incident review meeting.
To help teams coordinate, a primary channel exists, incident-followup. Responders can post the draft of the postmortem report, any work tickets, and other information here for everyone to find. The team responsible for the post-incident review can then create an incident-specific channel to discuss the incident and deliverables. That channel is archived after the incident review is completed.
The mixture of permanent and short-lived channels allows responders to mobilize effectively and collaborate for the duration of the incident and post-incident activities.
Using static channels accelerates mobilization for your most important incidents, gives stakeholders a place to follow along for updates, and provides information for everyone in the organization.
For more on how to add Slack or Microsoft Teams to your incident response workflows, see our Knowledge Base. To learn more about our Incident Response process, check out our Ops Guide. Join our community to connect with and learn from other PagerDuty users.