Common Pitfalls of Continuous Delivery: Deployment Raciness
Continuous Delivery (CD) is a terrific tool for increasing development velocity, but like everything worthwhile in life, it isn't free. Some costs are obvious: the fancy CI tool, the Agile consultant, the DevOps-friendly PaaS. A less obvious cost is the development effort required to keep your application CD-friendly. In this series, I'll go through some pitfalls that we've encountered at Rainforest along with ways to mitigate them. The first entry will cover one of the thorniest problems of the continuous delivery world: deployment race conditions.
In order to truly achieve Continuous Delivery, every service in your application must be independently deployable. This is obviously an important consideration for a service-oriented application, but it's counterintuitively equally important with a monolithic application.
Consider a typical Rails app with a single codebase. It may appear monolithic, but if you squint there are probably at least 4 "services":
(A quick anecdote: I once wrote some code with a deployment race condition between the application code and a migration. In the brief window of time between the application deploy and the migration completion, it had overpaid testers by thousands of dollars. Don't be like me —- think about race conditions in advance!)
Example: Adding a Field
column which can take value of either cat or dog. A typical Rails validation would look something like:
Then you would have a migration like:
def up addcolumn :users, :preferred
pet = 'dog'" end
(Backfilling with the superior choice, of course.) You'd also update the controllers and views accordingly.
What could go wrong if you try to deploy all these changes at once? That depends on the order in which the deployment events happen. If you're lucky, the application code will deploy completely before the migration completes. In that case, any attempt to save a user will result in a NoMethodError until the migration completes and the application restarts —- not necessarily a huge problem if the time window is short, but potentially a big problem if the migration gets stuck or you have heavy write traffic.
If you're unlucky, a save request will come in after the migration has completed, but with the old version of the code. This is a common scenario if you're using a rolling release strategy such as Heroku's preboot. In that case, the save will be successful but preferred_pet will stay NULL, leading to an invalid record that could lurk for months or years before causing a problem! (This is one of several reasons that we at Rainforest tend to prefer using database-level constraints instead of application-level validations when possible.)
Things get even more interesting when you're using a single-page app with a REST API. Let's say your application code has deployed and the migration has finished, but the user hasn't reloaded the page in a while and is editing the user record with a stale version of the frontend code. When they try to save the user, they'll get a mysterious "Preferred pet is required" error message, but won't have any way to set the field!
Solution 1: Think About Backwards Compatibility
The basic way to prevent deployment races is to make every service deployment backwards-compatible with current versions of other services. That includes "services" within your seemingly monolithic application. While this rule is simple in theory, it's quite easy to miss edge cases, with potentially nasty results.
Let's consider the preferred_pet example: how could you make it safe from deployment race conditions? One option would be to have a temporary database-level default that will prevent NULL fields until the code deployment is complete. A safe upgrade would involve 3 separate deployments:
This ignores the problem of frontend-backend compatibility, though. A truly safe upgrade might look something like this:
As you can see, things get complicated quite quickly! Unfortunately there's no way to avoid these issues in general; you either have to spend a lot of time thinking about backwards-compatibility or you have to accept some form of downtime (whether it's explicit downtime or "things not working right" for a period of time).
Solution 2: Explicitly Separate Your "Services"
Given that it's impossible to avoid thinking about backwards-compatibility with deploys, one way to prevent deployment races from slipping into your code is to explicitly separate non-atomic aspects of your codebase and test accordingly. At Rainforest, we decided that it was worth separating our backend and frontend code for exactly this reason.
To make this work, our frontend and backend code are in separate repositories with their own deployment pipelines. We even have two completely separate staging environments:
Since we test both backend and frontend code with Rainforest before deployment, bugs caused by backwards-incompatible changes are extremely rare.
Of course, this kind of setup is quite a lot of work to implement and maintain (as well as being expensive to run). You have to figure out if the tradeoff is worth it for your application. (We could implement a similar system for database migrations, for instance, but it hasn't been worth it so far.)
If you want to go further in depth on this topic, there are a couple of great posts on deployment raciness in our deployment academy: