Thanks to the crew at Lean Agile KC for putting together an exceptional LAKC17 this year. I really enjoyed the interaction with everyone curious about operations practices at VML. It was a pleasure sharing how we fixed our DevOps Culture Anti-patterns by learning on our internal portal. And taking a bit of time to go over some additional learning and practices in an Open Sessions was awesome. Thanks to everyone who continued the discussion. See below for the SlideShare and my notes with some of the things I spoke about not directly in the slides.
Do you have a company intranet or is it just a homepage that you replace with google, because that’s what people do. Do you have useful, functional tools or is it just something that wastes your time? Often times, internal company portals and intranets are cast away to the relegated trash heaps of technical debt and enforced homepages. So how did a “hand-crafted artisanal server” with code “carefully, expertly, and manually placed atop the multi-tiered system” turn into a bastion of usability, functionality, and a crowning achievement of DevOps processes within an organization? The journey into these methods shaped the architectural decisions of our internal systems, and the portal shaped the processes we used to mature our methods. Our portal has become a Passport into the company communications, where any part of the business can learn more about any other area. In our travels, successes, and failures, we educate on the following topic
We’re not IT, our team is called TechOps and we don’t do laptops, email, phone systems, etc. We focus specifically on client deliverables and consulting with their ops teams. We work directly with client development and operations teams to build out their enterprise CMS capabilities, web platforms, mobile platforms. We work with our developers to build out the capabilities to create and deploy new sites, apps, and whatever else our clients say they need from backend services to frontend beatification.
I’ve worked in PHP, .NET, and JAVA platforms across multiple fortune 500 clients that VML supports and I learned most of my enterprise capabilities on the fly, and internally within our dev systems.
So what does a snowflake server look like? We built out the complex diagram on slide 4 at a local KC datavcenter. This isn’t “cloud” or anything like that, just plain ol’ virtual machines that we were supporting manually by installing the operating system and the applications on top of them. Incidents in this environment could take between 8-12 hours to recover from, including late nights and weekends. And the big one took nearly 100 hours. Uptime was tailing downwards and it always seemed to be during the worst possible moments of need – after big announcements, company meetings, etc.
In slide 5, the simple solution concerned some people. But we figured out quickly that recovery (like a failed database deploy, application error, failed deployment, issues with login) took significantly less time to deal with. Like a deploy button was given to the development team and provided them quick rollback functionality: deployments less than 15 minutes to recover. Database recovery, less than 30 minutes to recover. Restarting an application, less than 15 minutes, and automatable with the development team capable of recognizing and triggering a restart required. And this provided such quick recoverability that the development team was able to quickly recognize common failures and resolve them to prevent them from happening again.
In 2012, our internal application development team created an amalgamation of Drupal (php), node, and other custom development to make a sort of blogging platform to share information, create groups based on teams, communicate company updates, and lots of other potential ideas instead of a generic static site with limited information and capabilities. They did this with limited consulting with people that actually used the site or the teams that would have to run the systems to keep it online. It wasn’t until it was ready to launch that my Operations team was notified that we needed to be “prod ready” in weeks. This meant creating virtual machines and installing software that was one-off from pretty much the rest of our environments. Our initial launch was being prepared to coincide with the annual company meeting. So basically, we had to scramble in order to get it pushed out the door without search capabilities, and about half of the functionality that the team wanted.
But at the launch, every employee could sign in with their network logins and post learnings and team updates with one of the features that was not necessarily at the top of the list of the development team. It was so successful that we spent the next two weeks restarting services and much extra support. This model was clearly unsustainable, so we had to figure out how to get the information the team needed to make proper development decisions, and get the work they were doing out to production fast.
So we sat with the team to figure out what they were actually doing in their development workflow. Whereas before we had to get a phone call, ticket, or whatever to have Ops log in and update the code, we wanted to give that team the power to do it themselves. We created a development and staging server, a fully automated deployment capability, and a new monitoring solution (NewRelic) that actually let them see how the code was performing rather than just server stats. Problem solved! High Fives on the way to the parking lot!
In 2013, passport was really popular, well known and used, and had basically become one of the most critical internal tools to the organization. Teams were using it as an index or source of information for projects to the point, that they couldn’t do some of their work without passport available. Downtime was really problematic as our team would get slammed and planned work was interrupted by having to take time away from client billable work to support our internal tools. It was taking up more and more time out of our days. We had to figure out how to keep that thing running with as few interruptions as possible.
We were borrowing from Peter to pay Paul. Kicking the can down the road. One of our engineers had come up with this engineering solution to make the site highly available. Failover capable load balancers. Multiple application servers. Failover/high availability shared file system through NFS. High availability cluster of MySQL Databases. And as long as we had the one person on the team who understood how all of these things worked, we were all ok. Unless he was sick. Or took a vacation. You can only take advantage of a workaholic for so long before something bad or crazy happens.
We limp along with every incident taking up more and more of our time. We gain enough experience learning about each mistake and think we’ve got it handled. Each incident however takes on longer amounts of time to deal with, but since we’ve now got multiple people trained to sort of solve the problem, nobody gets the whole picture as to what’s wrong. Developers add onto the complication of the system as more tools are integrated into the capabilities. Migrate to using a single sign on, centralized authentication server that allows more tools to be plugged into the eco system.
We had 3 people on the team dedicated to “internal projects” mainly to support this application. That’s a lot of non-bill hours for an agency that focuses a lot on delivering direct client work. Unrelated, underlying infrastructure problems were completely unforeseen. Most other systems known by lots of people on the team – no major issues. But passport? It was kaput. 2015, 4th of July weekend. Unrelated data integrity issue on the SAN destroys the systems in a (nearly) unrecoverable state. Nobody can follow the runbooks to re-install everything by hand and reconfigure everything by hand to keep the state of the system running. So instead, we agree something must change. The whole team collaborates on a solution that is simple, has great backups, highly recoverable. All under the watchful eye of the entire executive team and senior leadership team at our org since basically everything that isn’t email that they use to get stuff done is all down. The big bang of bad holiday weekends. Was almost better that it happened over a 3 day holiday weekend. Everything recovered quite nicely – eventually.
Then the CEO is on the email chain and says “NICE WORK GUY WHO WORKED 30 HOURS STRAIGHT TAKE THE DAY OFF”. So we decided not to wait for our Crashed Admin to wake back up again. Some things we still had to learn along the way, but the simplicity of the system meant 1 huge thing – Nobody had to learn new tech! Entire team knew apache webserver (no longer nginx and/or HAProxy combination). Entire team knew PHP as application server tech. Entire team knew MySQL (no Percona, weird clustering things). Entire team knew underlying infrastructure. We learned the rest on the fly because nothing was too crazy.
We added new deploy steps and functions and more automated tasks that the developers could run using Jenkins on their own time without our input. The team scales down to half time for 2 different people on the TechOps team to support this. Eventually, about a year later, it scales down to around .5 of a Full Time Employee.
Service level agreements will eventually be a thing of the past, and when it comes to internal systems rather than client-facing systems and client project systems, your Ops and IT teams should be focused on MTTR capabilities. Like using virtualization or cloud orchestration to bring servers online in an automated, repeatable way. This also means using tools like puppet and/or chef to install and configure your applications without requiring (much) manual intervention.
The question here is, do you care for, feed, do whatever you can to maintain the existing system, or do you have a recoverability capability that allows your team to build completely anew and pull in backups to recover quickly? Again, the focus, even on manual processes to resolve things in a quick efficient way. Even the least efficient manual way we could come up with – restoring all servers and data from offsite backup probably only takes about 4 hours. This would completely recover the site in a state from the previous successful midnight backup.
We moved to a centralized source of authority for the internal directory, a single SSO application. This improved security workflow as well as 1 disabled user was disabled across all systems. We also created an area/sandbox where people can dump their notes, their knowledge, and a space to basically pour everything that they are capable of within a system so that another person might be able to learn and take on a new role.
We figured out very early on in our efforts that automation buys back time to working on improving things. That time snowballs and creates the opportunity for experimentation and discovery. We use this to implement things like Puppet and take ideas directly from dev workflows. We have tickets that are supported by dev and ops team, we commit configuration changes which are then reviewed by multiple team members, and then approved, reviewed, tested, validated, and ticket closed.