The many reasons your deployment is racy
This is a guest post by Paul Biggar, Founder of CircleCI.
Remember when downtime was allowed? That was nice. Product teams used to schedule maintenance windows to get new code out. Good times, and those days have passed.
Web app deployments are now expected to happen continuously. This paradigm provides a plethora of benefits, but it has a fundamental flaw: the transition from the old code to the new code is racy. Every method of deployment, every transport, every PaaS, every code base suffers from this problem. Yes, even yours.
It’s impossible to fix it, but you can work around it. The best teams, like IMVU, Facebook and Etsy all spend time working around it. If you’re not actively working around it, I guarantee your users are affected by it.
What happens when you deploy new code? In an ideal world, you would have a clean break between the old code and the new code, but that’s not possible. Every race discussed here has a simple common cause: the old code is able to interact with the new code.
In fact, if we go back further in the history of web development, we can see the earliest examples of the race. We were deploying CGI scripts, written in Perl and PHP. When a user made a request, the CGI script called a program, which called other Perl or PHP files. The server didn’t know anything about the code, just that there was an executable program there.
What happened to requests that came in while you uploaded a new version of code to the server? There was a period of time where half of the files had been copied up, and half hadn’t yet. A request came in, and if your app was written in an interpreted language like Perl or PHP, a combination of new and old files might be loaded by the interpreter, causing new code to call old code in an unexpected way and royally screw things up.
While the specifics may differ, every deployment suffers from problems that are equivalent to this race. No matter how far along we go, this is what we will always be working against: a user is trying to use your web service while you are trying to upgrade it. There is old code, there is new code, they are different, and requests are still coming.
Fixing the basics
As the web world evolved, developers moved to mod_perl and mod_php, running under apache. While these were more advanced web servers, that race still existed, but in time we learned to deal with the problem. We uploaded new code to a separate directory from the old code, and atomically changed over to the new code by changing the symlink the server used as its web root. This meant that any new request would be guaranteed to hit either new code or old code.
Even newer web servers designed to support shared environments like uWSGI provide even more options for smooth code updates, as its creator discusses in The Art of Graceful Reloading. However, the popularity of deploying to private containers and VMs in recent years merit looking at examples of raciness in other architectures.
More modern examples
Let’s look at some examples using more recent deployment options like Heroku and AWS. You’ve got the same elements here as in the shared PHP server model: you have web servers running, you have to get your old code turned off and your new code turned on.
On Heroku, many details are handled for us now, but it’s fundamentally the same concept as the old days. You push code to Heroku, and if a user hits your app while the push is happening, the request is intelligently routed to a live dyno, right? Almost... By default, Heroku pauses the routing service while the transition happens. There is a Heroku Labs feature called pre-boot that lets you bring up new dynos with your latest code in advance of the cutover, but it’s not a silver bullet. Heroku’s docs on pre-booting warn that there can be a few minutes of overlap between versions.
If you deploy to AWS EC2 instances behind an Elastic Load Balancer (ELB), then you can perform a rolling update by gradually deregistering old instances and registering new instances until you have rolled over all of your servers. Again, there’s a minor gotcha here. If you want the rollover to be really graceful, you need to make sure that you have connection draining enabled on your ELB, another relatively new feature. This ensures that deregistering of instances waits until all connections are closed (within some timeout). Otherwise, some users will have their connections rudely interrupted.
In either the atomic symlink switch, the Heroku pre-boot, or ELB registering/deregistering, you need to deal with multiple versions of your application being live at once. This can easily take a number of minutes depending on how long your application takes to boot up, and how long it takes to close all connections (do you use any long-polling or websockets?), during which your app needs to be well-behaved.
In addition to making sure that multiple live versions of your application’s code interact gracefully, you need to think about the requests that hit your app during and after the transition. How does your old version differ from the new version? Did some URLs change? Your old version may have gone away, but the pages it generated, little ghosts of versions past, are distributed across the world, and still speaking the language of the old code. Users could be clicking buttons that shouldn’t exist anymore, hitting endpoints that have changed.
There are a lot of ways that multiple versions of code running at the same time can wreak havoc, but my favorite is breaking database schema changes between versions, simply because it is the most horrific and painful to deal with.
You can bring your site down, run the migration, and then deploy the new code, but remember, downtime is no longer acceptable! Some developers let requests hang for 30 seconds or more while migrations run, but that is only "not downtime" in the most nominal sense. Both of these solutions hurt the user experience, and neither of them scale to bigger data sets or more deployments per day.
A much more robust solution is to write code that can support both the old and new versions, while the migration runs. This can be as simple as
if(old_version) X(); else Y();. When the migration is finished, you commit again, removing the old version. RainforestQA did a great blog post on this topic which goes into a lot more detail.
Get used to this
In fact, this is going to be a standing solution to all racy-deployment problems. To deploy something, there’ll be a few steps to getting it fully deployed. First, you’ll deploy code which supports both the old and the new version, and then you do your data migration or whatever else, and then you push the final fix, which removes all remnants of the old code or references to the old version.
While all of these extra deployment steps for dealing with raciness may sound convoluted and difficult, you should make a point of practicing them to the point of boredom. That’s right, rolling updates and live schema changes should be boring, because of how often you go through the same sequence of incremental changes.
I’m a front-end developer, so I don’t need to worry about this, right?
Wrong! In addition to thinking about how endpoints hit from the front-end might change over time, single-page JS apps are putting more state, more logic, and more potential for disaster than ever in the browser. If users can interact with your entire application without ever leaving the page, and navigation is done without page refreshes through pushState, then no matter how fast you push new code to the server, you may have some users looking at pages that are days or weeks old!
Some low-refresh apps like Pivotal Tracker work around this problem by displaying a message prompting the user to refresh the page when the server has been updated, but it sucks to have to ask the user to upgrade like this. Another approach used in frameworks like Meteor is to do a "hot code push" from the server to the client. Of course if you take the hot-push approach, you need to be very careful about how you manage your frontend state, as you are basically recreating all of the challenges of database migration in the frontend.
The slowest race of all
While most continuously delivered applications can be deployed in seconds or minutes, there are a lot of processes a web application might need to run that take much longer. For example, Plan Grid runs OCR and other image processing against batches of building blueprints, ZYNC runs CGI rendering jobs, and we at CircleCI run our customers’ builds.
Because deployments can’t interrupt these slow jobs, at CircleCI, we have an auto-scaling group of build machines that are setup (more or less) so that when it’s time to scale down, the oldest machines die first, so our newer versions naturally make their way into production over the course of a day or so. The logic is similar to "OldestLaunchConfiguration" termination policy for AWS Auto Scaling Groups.
Because we have so much overlap between build machine versions, we have a few HUBOT commands that help us manage things. For example, when we say
@hubot admin rev-breakdown, HUBOT responds with something like this:
Rev | Count | By | Message | Time ba318c8ccd | 43 | gordonsyme | Merge pull request #3153 from circleci/fix-type-conditions | 13 hours ago f86c21f118 | 15 | notnoopci | Merge pull request #3155 from circleci/io-health-checks | 6 hours ago d25a29fd73 | 6 | dwwoelfel | Merge pull request #3151 from circleci/new-image | 5 hours ago 9eedfbeeca | 9 | notnoopci | Merge pull request #3152 from circleci/kernel-patch | 5 hours ago b1cead9c84 | 5 | esnyder | Merge pull request #3159 from circleci/pusher-oss-auth | 4 hours ago
This lets us see at a glance how new versions are progressing into production. We can also say things like
@hubot admin shutdown-rev to gracefully shut down a particular version from production, or
@hubot admin force-shutdown to kill certain instances. Once again, the key is to live with the raciness. Let different versions of your code laugh and play together in peace and harmony!
You need to be prepared for at least two versions of your code--probably more--to be in play for anywhere from several seconds to several weeks or more. There is no way to get rid of this problem--certainly not without sacrificing availability--so do yourself a favor and learn to live with it.