On-Call Shifts

Going On-Call

My greatest fear is getting a high priority page while on-call in a packed Broadway theater, and PagerDuty plays a progressively louder and louder quacking siren, and Alexander Hamilton himself comes up to slap me and my phone out of my hand, and walks me out of the theater” via Twitter

Going on-call for the first time can seem intimidating. But with proper preparation and practice, incident responders can have a positive experience that provides useful hands-on experience they can use to build more resilient production services. This section details how to prepare for going on-call and what to do during your shift in the on-call rotation.

Before Going On-call#

Relax. Breathe. While you're on-call, for the duration of your shift, it's your responsibility to acknowledge notifications, at any time, whenever they occur. However, it's also essential to remember that life happens. If you know you may be unavailable in advance, work with your teammates to schedule around these events to ensure coverage when you're unavailable. If you're not available when the moment happens, for whatever reason, know that escalation policies exist to ensure that someone will eventually respond if you should be unable.

Expectations#

If you're on-call, you're expected to be available during your shift to the best of your ability. Beyond acknowledging notifications, an on-call responder is expected to have the skills necessary to triage an incident and determine an appropriate course of action. Sometimes, the appropriate course of action is determining that you're unable to resolve the incident on your own. It is also expected that responders will page additional engineers, as needed, in order to resolve an incident.

Be Prepared#

Proper preparation reduces stress and anxiety when going on-call. It also helps mitigate chaos if and when an incident occurs. The first step to going on-call is to understand exactly what you will need so that you aren't fumbling around at 3 a.m. when a notification comes in. Use this checklist to help you get started:

Notification Preferences#

Notification preferences determine how and when you will be notified when an incident occurs. Alert preferences can typically be set to notify you in a variety of ways including via SMS, phone call, email, or app push notifications. Be sure that any method you've chosen is one that will always be available and will get your attention throughout your on-call shift.

Customized Phone Settings#

Incidents don't care about your normal work hours. You should take extra steps to ensure that incident notifications are always surfaced at any time.

Notification Staggering#

If the various notification methods you've chosen are set to all alert you at the same time, you have an increased risk of missing all of them. For example, you might be in bed, sound asleep. Staggering notifications can help get your attention successfully. For example, start with an SMS notification. One minute later, send an app push notification. One minute after that, call your phone. And so on.

During On-Call#

You've started your on-call shift. Now what? You wake up at 2 a.m. to the sound of a foghorn as your phone alerts you to an incident. Your eyes are foggy and your heart is racing. It could be a terrible situation. But it won't be because you used the advice above to prepare ahead of time. You acknowledge the notification. Now we get to work!

Triage#

During the triage process, responders are expected to evaluate the situation at hand. The following non-exhaustive list may be useful to ensure a few basic sanity checks when responding to an incident and beginning your triage.

Taking Action#

When you have a potential solution, your task is to return the service to full operational status. You have the authority to dive right in to fix what needs to be fixed, involve other teammates as needed, and escalate to an appropriate severity level if necessary. You also have the authority to delay additional work for non-time sensitive and non-impacting issues.

While resolving a minor incident, you should keep basic notes when practical. It is vital to share information with your team, such as the symptoms that initially triggered the incident, investigative steps you took, and the actions taken to successfully resolve the incident. These notes can be immensely helpful in keeping your runbooks up-to-date. You should strive to continuously refactor and improve your team's knowledge base and documentation. Add redundant links and pointers if your mental model of the docs or codebase don't match what you discover during the course of the incident.

If a minor incident is beyond the scope of what your team is responsible for or can do to resolve the issue, you may need to initiate a major incident.

After On-Call#

At the end of an on-call shift, responders should have an On-Call Review Meeting. Practicing a proper on-call handoff provides the team an opportunity to learn directly from the responder(s) coming off the previous shift. The meeting allows your team to catch problems before they become trends or contribute significant negative impact.

Are there noisy alerts that need to be cut down to reduce alert fatigue? Is this service needlessly sending notifications for non-actionable events? The primary purpose of this review is to understand the on-call load for this shift, identify any sources of pain, transfer knowledge between responders, and plan for improvements of future on-call shifts.

On-Call Etiquette#

A bit of extra consideration can make the on-call experience better for everyone involved so we provided a few helpful on-call etiquette tips below: