continuous deployment

Common Pitfalls of Continuous Delivery: Deployment Raciness

Picture of Emanuel Evans
Emanuel Evans, Thursday May 11, 2017

Continuous Delivery (CD) is a terrific tool for increasing development velocity, but like everything worthwhile in life, it isn't free. Some costs are obvious: the fancy CI tool, the Agile consultant, the DevOps-friendly PaaS. A less obvious cost is the development effort required to keep your application CD-friendly. In this series, I'll go through some pitfalls that we've encountered at Rainforest along with ways to mitigate them. The first entry will cover one of the thorniest problems of the continuous delivery world: deployment race conditions.

The Problem

In order to truly achieve Continuous Delivery, every service in your application must be independently deployable. This is obviously an important consideration for a service-oriented application, but it's counterintuitively equally important with a monolithic application.

Consider a typical Rails app with a single codebase. It may appear monolithic, but if you squint there are probably at least 4 "services":

  • Backend web server code
  • Code for background jobs (with something like Sidekiq)
  • Frontend code
  • The database

Each of these "services" has some internal atomicity guarantees (you won't get an API request handled by multiple revisions of your code, for instance), but it's impossible to deploy all of them at the same time without downtime. If you don't carefully consider the consequences, you will run into nasty race conditions. Modern application architecture and DevOps patterns can actually make the problem worse: a single-page app will have much longer-lived JavaScript sessions, and deployment techniques like rolling releases mean that multiple versions of your backend code can coexist for quite some time.

(A quick anecdote: I once wrote some code with a deployment race condition between the application code and a migration. In the brief window of time between the application deploy and the migration completion, it had overpaid testers by thousands of dollars. Don't be like me —- think about race conditions in advance!)

Example: Adding a Field

Let's consider a common scenario: you want to add a new required field to one of your tables. For concreteness, let's say you have a users table and are adding a preferred_pet column which can take value of either cat or dog. A typical Rails validation would look something like:

class User < ActiveRecord::Base
  validates :preferred_pet, inclusion: { in: %w(cat dog) }, presence: true
end

Then you would have a migration like:

def up
  add_column :users, :preferred_pet, :string
  execute "UPDATE users SET preferred_pet = 'dog'"
end

(Backfilling with the superior choice, of course.) You'd also update the controllers and views accordingly.

What could go wrong if you try to deploy all these changes at once? That depends on the order in which the deployment events happen. If you're lucky, the application code will deploy completely before the migration completes. In that case, any attempt to save a user will result in a NoMethodError until the migration completes and the application restarts —- not necessarily a huge problem if the time window is short, but potentially a big problem if the migration gets stuck or you have heavy write traffic.

If you're unlucky, a save request will come in after the migration has completed, but with the old version of the code. (This is a common scenario if you're using a rolling release strategy such as Heroku's preboot.) In that case, the save will be successful but preferred_pet will stay NULL, leading to an invalid record that could lurk for months or years before causing a problem! (This is one of several reasons that we at Rainforest tend to prefer using database-level constraints instead of application-level validations when possible.)

Things get even more interesting when you're using a single-page app with a REST API. Let's say your application code has deployed and the migration has finished, but the user hasn't reloaded the page in a while and is editing the user record with a stale version of the frontend code. When they try to save the user, they'll get a mysterious "Preferred pet is required" error message, but won't have any way to set the field!

Solution 1: Think About Backwards Compatibility

The basic way to prevent deployment races is to make every service deployment backwards-compatible with current versions of other services. That includes "services" within your seemingly monolithic application. While this rule is simple in theory, it's quite easy to miss edge cases, with potentially nasty results.

Let's consider the preferred_pet example: how could you make it safe from deployment race conditions? One option would be to have a temporary database-level default that will prevent NULL fields until the code deployment is complete. A safe upgrade would involve 3 separate deployments:

  1. Deploy just the migration, with a default value included.
  2. Deploy the application code with a validation that allows the value to be filled in.
  3. Deploy another migration removing the default value (to avoid future confusion).

This ignores the problem of frontend-backend compatibility, though. A truly safe upgrade might look something like this:

  1. Deploy the initial migration with a default value.
  2. Deploy backend code that allows you to set the value through the API, but without making it a required field.
  3. Deploy frontend code that sets the field.
  4. Deploy backend code with the presence validation.
  5. Deploy the cleanup migration that removes the default.

As you can see, things get complicated quite quickly! Unfortunately there's no way to avoid these issues in general; you either have to spend a lot of time thinking about backwards-compatibility or you have to accept some form of downtime (whether it's explicit downtime or "things not working right" for a period of time).

Solution 2: Explicitly Separate Your "Services"

Given that it's impossible to avoid thinking about backwards-compatibility with deploys, one way to prevent deployment races from slipping into your code is to explicitly separate non-atomic aspects of your codebase and test accordingly. At Rainforest, we decided that it was worth separating our backend and frontend code for exactly this reason.

To make this work, our frontend and backend code are in separate repositories with their own deployment pipelines. We even have two completely separate staging environments:

  • The frontend staging environment uses the development version of the frontend code and the production version of the backend code
  • The backend staging environment uses the development version of the backend code and the production version of the frontend code

Since we test both backend and frontend code with Rainforest before deployment, bugs caused by backwards-incompatible changes are extremely rare.

Of course, this kind of setup is quite a lot of work to implement and maintain (as well as being expensive to run). You have to figure out if the tradeoff is worth it for your application. (We could implement a similar system for database migrations, for instance, but it hasn't been worth it so far.)

More Reading

If you want to go further in depth on this topic, there are a couple of great posts on deployment raciness in our deployment academy:

Filed under: continuous delivery and deployment races