Common Pitfalls of Continuous Delivery: Deployment Raciness
Continuous Delivery (CD) is a terrific tool for increasing development velocity, but like everything worthwhile in life, it isn't free. Some costs are obvious: the fancy CI tool, the Agile consultant, the DevOps-friendly PaaS. A less obvious cost is the development effort required to keep your application CD-friendly. In this series, I'll go through some pitfalls that we've encountered at Rainforest along with ways to mitigate them. The first entry will cover one of the thorniest problems of the continuous delivery world: deployment race conditions.
In order to truly achieve Continuous Delivery, every service in your application must be independently deployable. This is obviously an important consideration for a service-oriented application, but it's counterintuitively equally important with a monolithic application.
Consider a typical Rails app with a single codebase. It may appear monolithic, but if you squint there are probably at least 4 "services":
- Backend web server code
- Code for background jobs (with something like Sidekiq)
- Frontend code
- The database
(A quick anecdote: I once wrote some code with a deployment race condition between the application code and a migration. In the brief window of time between the application deploy and the migration completion, it had overpaid testers by thousands of dollars. Don't be like me —- think about race conditions in advance!)
Example: Adding a Field
Let's consider a common scenario: you want to add a new required field to one of your tables. For concreteness, let's say you have a
users table and are adding a
preferredpet column which can take value of either
dog. A typical Rails validation would look something like:
class UserThen you would have a migration like:
def up addcolumn :users, :preferredpet, :string execute "UPDATE users SET preferredpet = 'dog'" end(Backfilling with the superior choice, of course.) You'd also update the controllers and views accordingly.
What could go wrong if you try to deploy all these changes at once? That depends on the order in which the deployment events happen. If you're lucky, the application code will deploy completely before the migration completes. In that case, any attempt to save a user will result in a
NoMethodError until the migration completes and the application restarts —- not necessarily a huge problem if the time window is short, but potentially a big problem if the migration gets stuck or you have heavy write traffic.
If you're unlucky, a save request will come in after the migration has completed, but with the old version of the code. (This is a common scenario if you're using a rolling release strategy such as Heroku's preboot.) In that case, the save will be successful but
preferred_pet will stay
NULL, leading to an invalid record that could lurk for months or years before causing a problem! (This is one of several reasons that we at Rainforest tend to prefer using database-level constraints instead of application-level validations when possible.)
Things get even more interesting when you're using a single-page app with a REST API. Let's say your application code has deployed and the migration has finished, but the user hasn't reloaded the page in a while and is editing the user record with a stale version of the frontend code. When they try to save the user, they'll get a mysterious "Preferred pet is required" error message, but won't have any way to set the field!
Solution 1: Think About Backwards Compatibility
The basic way to prevent deployment races is to make every service deployment backwards-compatible with current versions of other services. That includes "services" within your seemingly monolithic application. While this rule is simple in theory, it's quite easy to miss edge cases, with potentially nasty results.
Let's consider the
preferred_pet example: how could you make it safe from deployment race conditions? One option would be to have a temporary database-level default that will prevent
NULL fields until the code deployment is complete. A safe upgrade would involve 3 separate deployments:
- Deploy just the migration, with a default value included.
- Deploy the application code with a validation that allows the value to be filled in.
- Deploy another migration removing the default value (to avoid future confusion).
This ignores the problem of frontend-backend compatibility, though. A truly safe upgrade might look something like this:
- Deploy the initial migration with a default value.
- Deploy backend code that allows you to set the value through the API, but without making it a required field.
- Deploy frontend code that sets the field.
- Deploy backend code with the presence validation.
- Deploy the cleanup migration that removes the default.
As you can see, things get complicated quite quickly! Unfortunately there's no way to avoid these issues in general; you either have to spend a lot of time thinking about backwards-compatibility or you have to accept some form of downtime (whether it's explicit downtime or "things not working right" for a period of time).
Solution 2: Explicitly Separate Your "Services"
Given that it's impossible to avoid thinking about backwards-compatibility with deploys, one way to prevent deployment races from slipping into your code is to explicitly separate non-atomic aspects of your codebase and test accordingly. At Rainforest, we decided that it was worth separating our backend and frontend code for exactly this reason.
To make this work, our frontend and backend code are in separate repositories with their own deployment pipelines. We even have two completely separate staging environments:
- The frontend staging environment uses the development version of the frontend code and the production version of the backend code
- The backend staging environment uses the development version of the backend code and the production version of the frontend code
Since we test both backend and frontend code with Rainforest before deployment, bugs caused by backwards-incompatible changes are extremely rare.
Of course, this kind of setup is quite a lot of work to implement and maintain (as well as being expensive to run). You have to figure out if the tradeoff is worth it for your application. (We could implement a similar system for database migrations, for instance, but it hasn't been worth it so far.)
If you want to go further in depth on this topic, there are a couple of great posts on deployment raciness in our deployment academy: