Knowing what your team should do during an incident or outage is like taking a first aid course: it’s a tool to help you navigate what is often a high-stress situation. Having frameworks in place will help teams focus on the unknown parts of an incident.
What is an Incident?
Modern technical systems have all kinds of dependency relationships, software and hardware requirements, and complex architectures. When all of these components are connected up, wonderful things can happen, but so can errors. There are all sorts of faults or failures that can impact the users of your applications.
In our systems, we want to determine which errors or faults impact users, and to what extent. When users are negatively impacted by an unplanned disruption or degradation, we refer to this as an incident. There are other terms you might have heard, such as an outage, that your team might use depending on the severity of the errors.
Teams might classify incidents according to how disruptive they are to users. A minor incident might have no user impact or limited impact. It could also be a minor feature and not a main piece of functionality. A major incident, by contrast, might mean that most or all users are impacted, or that a primary feature is unavailable or degraded. How your team classifies incidents is really up to your team and organization and can vary during the time of day, or the day of the week, or for other reasons depending on when users are most active.
We also describe incidents as unplanned to differentiate them from disruptions that we might have made arrangements for, such as planned upgrades or installations that we expect to have some impact on user-facing applications. You’ve likely seen notifications alerting users that a service will be unavailable for some period of time on a specific date so they can plan accordingly.
Incidents are the things that pop up when we didn’t have those plans. A dependency becomes unreliable. The network is slow. A disk fails in the storage array. The reality is that these errors are common. Sometimes they don’t have a great impact on your services and sometimes they may cause your service to be unavailable to users. It’s a case of when and not if something will happen, so it’s important to be ready.
How your team will respond to an incident on your applications may vary from the methods other teams use. The key, though, is to plan ahead and practice. Incident Response is an organized approach to addressing and managing incidents. When our applications are failing, we want to solve the problem and get them back to normal, but we want to handle the situation in a way that minimizes the potential for more damage and reduces the recovery time.
You may have heard some terms around incidents, like mean time to acknowledge (MTTA) or mean time to resolve (MTTR). These are aggregate measures to help your team improve your response process. Your MTTA is how long it takes someone on the team to acknowledge that an error has occurred and an incident needs to be investigated. This time can vary depending on how long it takes to contact your team member, how you have your on-call schedules configured, how large your team is, and any number of other factors.
MTTR is about the next step – how long it takes to resolve the incident. Some incidents might be very easy; for example, maybe a service stops responding, and restarting it fixes the issue. Other incidents might require orchestration and communication among several teams that own different components in the ecosystem and therefore take longer to complete. Some incidents might be caused by serious bugs in the application and not actually be resolved until new code is written and deployed.
Planning for unplanned incidents? How does that make sense? For this coordination, we borrow from emergency response planners, specifically from a framework that establishes an Incident Commander to lead the response gathering and cut through any bystander effect. We want to know in advance how our teams will coordinate. Will there be a conference call? Will there be a dedicated channel in our team chat? Who will update the customers?
During a major incident, the incident commander is completely in charge. They effectively outrank everyone else in the organization within the bounds of resolving the incident. An incident commander will bring everyone to a single communications channel, coordinate activities, delegate repair activities to appropriate personnel, and be the source of truth for information about the incident. An incident commander isn’t there to work on triaging the problem or resolving the incident, they are the leader of the resolution efforts across all teams.
The key to this method of incident response is to plan ahead. We know incidents will happen at some point, so we can be ready. Incident Commanders are trained in advance and might practice their skills on smaller incidents so they’re ready for major incidents. We want teams to be prepared to respond and understand how the response will run, so running a full-scale response on smaller incidents helps the teams, too. Set up the conference call, join the incident channel in chat, know who the Incident Commander is and follow their instructions.
Here are some additional recommendations for designating key roles for incidents:
- The Deputy is the Incident Commander’s right hand and helps track the status of activities or requests.
- The Scribe compiles the notes from any text chat and conference calls for use in the post-incident review
- The Subject Matter Experts are called in to investigate and fix the issue. These folks are usually members of the engineering teams who know the systems the best.
- The Customer Liaison is responsible for crafting messages for customers and posting them in the proper places. This could be an external status page or in the customer service applications for use by CS staff.
- The Internal Liaison keeps the rest of the organization informed internally. They might be on a separate conference call with executives or posting emails or chat updates that are only visible to employees.
When an incident happens, this team mobilizes to ensure that the incident gets handled appropriately, that the right people are involved, and that the right information is available.
Incident Commanders, Deputies, Scribes, and Liaisons are trained to lead the response and coordinate across all teams that might be involved. This includes communicating with your users. For all of this to run smoothly during an incident, teams create playbooks of common actions.
The most common action during a major incident is to declare the incident in the first place. You can think of it as teaching everyone how to pull the fire alarm in case of a fire. You might have a special command or channel in your chat program or a way to page an Incident Commander. Having some explicit process is important for everyone in your company to know in the event they spot something wrong with your systems.
The next piece to coordinate is the internal communications channels. You might have the Incident Commander start a conference bridge. Companies that use Network Operations Centers (NOCs) might have a NOC technician start a call instead. It’s also a good idea to start a text channel for folks to coordinate and for the scribe to document activities into the channel history.
When an incident is declared, teams that might be needed during the investigation and resolution also need to know how to join the conference bridge and incident channels. They might have an on-call rotation, and the Incident Commander can ask for them to be paged into the incident, in which case they need to know where to join.
When the incident is resolved, have a process for explicitly declaring the incident over. The Incident Commander will assign someone on the team to manage the post-incident review, and that person will shepherd that process. The Liaisons will communicate with internal stakeholders and external users that the incident has been cleared.
Keep Users Informed
We’ve all probably experienced being on the user-end of an incident for some service we use. Maybe you’ve opened an application on your mobile device and nothing loads. What do you do? How do you know if it’s just you or if the application is down?
Keeping users informed during an incident can be as simple as updating a status page with a generic message like “we are aware of an issue on Application X affecting users. Our teams are investigating.” Teams should have a centralized place to report their status updates during an incident and to broadcast the “all clear” when the incident is over. There are a number of services that function as centralized locations for these updates as well as social media accounts used for that purpose.
Your Liaisons will also keep your customer support or customer service teams informed of status updates. Your users might be more inclined to call or send an email to your support team than they are to look for a status update in other venues. Your support team can also proactively reach out to VIP customers after an incident to make sure they have seen their issues resolved.
Making Space in Team Workload
If all of this sounds like work, it definitely is! We know incidents will happen. We don’t know when they will happen, but when they do, we need to know how to handle them. We also need to make time for rest and repairs after impactful incidents.
Busy software teams want to be creating new features, fixing bugs, and shipping code as much as possible. Responding to incidents, answering pagers, can feel like it is taking away from that work, even if the outcomes will make the applications better over time. We want to make sure that when folks are doing a lot of unplanned work that the expectations for their planned work are adjusted.
For teams with a regular on-call rotation, planning for unplanned work should include reducing the number of regular tasks the on-call team member will work on during their shift.
Teams without explicit on-call responsibilities might have to adjust workloads after an incident rather than be proactive. This isn’t unusual; common outcomes from unplanned incidents might be fixes to the application that need to be prioritized over the current work-in-progress, which is then delayed until after the fix is complete
Technical services and applications are just getting more and more complex every day. As we build cool things that people want to use, we’ll occasionally have errors that pop up and we need to be ready to handle them. While we can’t plan for every possible error case, we can make plans for how we will respond and practice those plans when the stakes are low.
We’re all human, and we’re all learning how to manage and maintain systems in a constantly changing environment.
Article written by Mandi Walls, DevOps Advocate