That was the Facebook status of a former co-worker about a week ago. I happened to be awake and online at the same time (he’s on the East Coast; it was only midnight here) and immediately responded, “The better question is, ‘Why?’”
Deployment. Production Push. Go Live. Rollout. Whatever you call the process of turning your development codebase into a live, production application, I sincerely hope you’re not living in the Stone Age and doing it in the middle of the night under the guise of avoiding customer impact. Unfortunately, if my past experiences, and the experiences of many I’ve spoken to, are the norm, you very likely are. If your strategy to avoid customer interruption is based solely on trying to avoid your customers, you’re setting yourself up for even more headaches and long-term failure.
The motivations for these overnight deployments are suspect at best. The claim is that by avoiding the daylight hours, fewer customers will be impacted by the rollout. Problem 1: You presume there will be problems that impact availability. You have no confidence in your code quality; or (or maybe, and), you have no confidence in your infrastructure and deployment process. If you lack confidence that your new system is ready for production, you probably shouldn’t be pushing it to production! If you think your servers aren’t ready or that your deployment process stinks, take the extra time now to improve them. I’ve seen great companies with absolutely terrible build and deployment processes who have nightmares getting code into production. At the same time, these same companies refuse to devote more than a single person (or maybe only part of his or her time) to improving that process.
Perhaps even more suspect in the reasoning is the notion that because the process is complicated and volatile, it should be done in off-hours. Problem 2: You’ve got a complicated process and you’re sending over-tired, over-worked people to deal with it. Imagine, for a moment, that your team is rolling out an update to a service that monitors life-support systems in hospitals. Do you want tired, stressed, and unmotivated people working on the process? If deployment is one of your most complicated procedures, why are you sending your people at their worst to handle it?
Earlier, I mentioned that teams are simply attempting to avoid customers by deploying overnight. Aside from this being a futile goal for any global business, it likely suggests you’re missing two things. Problem 3: You have no means of doing a phased rollout or a quick rollback. Deployments in this world are likely one-way affairs with a lot of time devoted to pushing the new code and no clean way to revert those changes quickly if something goes South. Make no mistake, I’m not suggesting that deployments are easy (or even that they should be). Nor am I suggesting that everything should always be perfect when deploy code. However, attempting to sneak code out in the middle of the night is hardly meeting the challenge head on.
Compounding all of these issues is the fact that there are some problems you can only see as certain scale is achieved. By hiding from your customers during deployment, you may also be burying your head in the sand with regard to these potential bugs.
Plan For Success; React Quickly to Obstacles
There are several techniques I’ve seen employed that have had a great impact on improving the deployment process to the point teams have felt comfortable deploying while the sun is up.
Involve your QA team early so they have a full understanding of the feature and how to test it. Foster a partnership between QA engineers and developers so they work together to understand the full impact of the feature and ensure that your testing, especially regression, is thorough enough to develop high confidence in your quality. Always remember that the later in your process bugs are discovered and fixed the more expensive it becomes (especially if these bugs make it all the way to production). Incentivize your people around delivering quality early–not finding bugs late.
Devote time and energy to your deployment processes; don’t shunt them off onto one guy working in isolation. Establish an owner; but, make sure this person is integrated with the rest of the development team and understands their pain points and needs. Automate complicated manual processes to prevent mistakes (you know, the type of mistakes that happen when a tired engineer is sitting at his or her console at 3:00 am).
Decouple various parts of your system so they can be deployed and rolled back independently. There’s no sense in having to take your checkout process offline simply because of a regression in your unrelated public API. This concept is often easier said than done; but it’s incredibly important and worth your team’s investment in time.
Use feature kill-switches aggressively; allow certain parts of your application to be turned on and off via runtime configuration. Deprecate old functionality rather than destroying it in your codebase. Allow the feature switch to revert to the old code paths without forcing a code rollback or additional push. Once you’re confident in your new functionality, the deprecated paths can be removed in your next deployment. In cases where this concept is prohibitively difficult or even impossible, modularize the code containing that feature and run both so you can quickly switch back to the old code if necessary.
Avoid unnecessary deployments. When I talked to my friend mentioned above about his 3:00 am deployment, he told me they had to do it to end a contest that their website had been running. I’m sorry; that’s just not a reason to have to push new code to production. Feature switches targeting alternate code paths could have solved that problem. Moreover, they could have been set on a timer such that the moment the contest ended, the entry form was disabled. Even following every suggestion here and others you can find, deployments are never going to become easy, only “easier.” Don’t saddle your team with more of them than you need.
Release early, release often. I realize this mantra has been repeated over-and-over for years; but, that’s only because it’s such important advice. By releasing new code to production often, you’re shrinking the size of each deployment. The less stuff that changes, the less that can go wrong.
Create a system for phasing your rollouts. It’s a much better way to reduce customer impact of issues than simply hiding from your customers. Take your time between each phase to really let issues surface. An example of such a plan would start with a small number; like 5-10%. This level of exposure is still likely less than 100% at 3:00 am; but it’s also likely large enough to alert you to any glaring issues quickly. Once you’ve cleared that hurdle, ramp up to a number that gives you some level of scale; say 50%.
This level keeps your customer impact somewhat diminished if anything goes wrong; and, it will expose issues that may not appear until your app is running “at scale” (such as a new API call taking far more cycles than intended because the caching isn’t working correctly. You may not notice these extra cycles at low volume; or worse, you may simply write it off that the service isn’t seeing enough traffic to really warm the cache). Once you’ve crossed that hurdle, you’re ready to ramp up to 100%. Each phase should be designed to contribute confidence along the way that once all customers have this new code, they’ll be getting the quality experience you intended to deliver.
Get Some Sleep (Or Maybe…)
Ultimately, deploying overnight is likely indicative of something (probably several somethings) being broken in your process. Luckily, by considering your deployment process an important part of your product and devoting time and energy to it, your can turn overnight deployment into a thing of the past and reclaim those late nights for important things like sleep.
Or Adult Swim. Your choice.