Production Support turns to Site Reliability Engineering

This week I have talked about a couple of books that have impacted my work and career. Please go back and read the following posts about two that have shaped me into the engineer and manager that I am today:

The book that really gave a gut punch to my specific day to day capabilities was Site Reliability Engineering: How Google Runs Production Systems

What became most obvious about reading this book is not that I should do what Google does and implement everything that they were doing, but the intangible or culture changes that our organization needed to undertake. Nobody but Google can do what Google does, so why should others try to replicate it exactly? Rather it was more important to strive for a few key best practices and ideals that would make my work space simpler.

We have had incidents in the past that have surely led to lost revenue or service level impacts, but this book presented a few ideas around creating a framework for response and troubleshooting that we have altered to fit our business and team. Having a playbook that someone can follow before having to find “that one person who knows the system” has led to fewer instances of burnout. It helped us identify key issues that were occurring frequently and create a method for reducing the errors and failures around those.

The SRE book has also helped us shape our internal focus on reducing the complexity of our systems and instead create processes and systems that are defined by high recoverability rather than high availability. We use systems and tools that enhance our ability to quickly respond to events and rebuild and restore if necessary rather than spending time on the maintenance and upkeep needed for complex highly available servers and applications.

And beyond all of the direct changes we have made to systems and processes, we have fostered a culture of learning and improvement. Mistakes went from being barely tolerated to learning opportunities to make the organization respond better each time a problem or issue occurs. Why lose out on expensive lessons, when that kind of on the job training is invaluable in shaping excellent engineers. Creating a space where a team can safely fail forward has led to some of the greatest improvements our team has seen.

Site Reliability Engineering: How Google Runs Production Systems