Dev Up St. Louis – Troubleshooting Production and Incident Response

What a great group of attendees along with the organizers just knocking it out of the park at this year’s dev up Conference. I’ve got the slides embedded after the jump and below that have a few notes about things I discussed not on the slides. Please feel free to reach out to me on twitter if you have any questions or just want to talk about these topics further.

As I pointed out, monitoring, observability, and other methods of finding out things are wrong are far superior to receiving the angry phone call directly from your customers. That is a topic for another day, one I may touch on in future talks.

I brought up “hero culture” pretty early in the talk, because it’s an important point. People who want to save the day and get a pat on the back while showing “how much they know” can lead to burn out, knowledge hoarding, and demoralizing the rest of the team. One of the key topics that I returned to in my talks was the repeatability of what we did and documenting what we learned. A learning culture leads to fewer incidents in the future.

We learned that the stress of dealing with incidents is detrimental to the communications surrounding serious problems. It’s always best to have a “communicator” who is responsible for blocking for the people doing that hands on keyboard work. While it may sound awesome on the phone when someone tells a client exec “Do you want me to tell you about the problem or fix the problem?”, it’s not as much fun the next day getting scolded by your bosses.

I also brought up the topic of understanding what is wrong versus doing what is asked of you before taking action. When someone says “just restart the service” instead of describing the problem they are facing, it’s possible to introduce larger issues into an already serious enough problem you’ve received escalation for. So if you do not understand the environment, architecture, or tech stack, find the right person who does.