Common Pitfalls of Continuous Delivery: Deployment Raciness

Continuous Delivery (CD) is a terrific tool for increasing development velocity, but like everything worthwhile in life, it isn't free. Some costs are obvious: the fancy CI tool, the Agile consultant, the DevOps-friendly PaaS. A less obvious cost is the development effort required to keep your application CD-friendly. In this series, I'll go through some pitfalls that we've encountered at Rainforest along with ways to mitigate them. The first entry will cover one of the thorniest problems of the continuous delivery world: deployment race conditions.

The Problem

In order to truly achieve Continuous Delivery, every service in your application must be independently deployable. This is obviously an important consideration for a service-oriented application, but it's counterintuitively equally important with a monolithic application.

Consider a typical Rails app with a single codebase. It may appear monolithic, but if you squint there are probably at least 4 "services":


Each of these "services" has some internal atomicity guarantees (you won't get an API request handled by multiple revisions of your code, for instance), but it's impossible to deploy all of them at the same time without downtime. If you don't carefully consider the consequences, you will run into nasty race conditions. Modern application architecture and DevOps patterns can actually make the problem worse: a single-page app will have much longer-lived JavaScript sessions, and deployment techniques like rolling releases mean that multiple versions of your backend code can coexist for quite some time.

(A quick anecdote: I once wrote some code with a deployment race condition between the application code and a migration. In the brief window of time between the application deploy and the migration completion, it had overpaid testers by thousands of dollars. Don't be like me —- think about race conditions in advance!)

Example: Adding a Field

column which can take value of either cat or dog. A typical Rails validation would look something like:

class User
Then you would have a migration like:
def up   addcolumn :users, :preferred
pet = 'dog'" end
(Backfilling with the superior choice, of course.) You'd also update the controllers and views accordingly.

What could go wrong if you try to deploy all these changes at once? That depends on the order in which the deployment events happen. If you're lucky, the application code will deploy completely before the migration completes. In that case, any attempt to save a user will result in a NoMethodError until the migration completes and the application restarts —- not necessarily a huge problem if the time window is short, but potentially a big problem if the migration gets stuck or you have heavy write traffic.

If you're unlucky, a save request will come in after the migration has completed, but with the old version of the code. This is a common scenario if you're using a rolling release strategy such as Heroku's preboot. In that case, the save will be successful but preferred_pet will stay NULL, leading to an invalid record that could lurk for months or years before causing a problem! (This is one of several reasons that we at Rainforest tend to prefer using database-level constraints instead of application-level validations when possible.)

Things get even more interesting when you're using a single-page app with a REST API. Let's say your application code has deployed and the migration has finished, but the user hasn't reloaded the page in a while and is editing the user record with a stale version of the frontend code. When they try to save the user, they'll get a mysterious "Preferred pet is required" error message, but won't have any way to set the field!

Solution 1: Think About Backwards Compatibility

The basic way to prevent deployment races is to make every service deployment backwards-compatible with current versions of other services. That includes "services" within your seemingly monolithic application. While this rule is simple in theory, it's quite easy to miss edge cases, with potentially nasty results.

Let's consider the preferred_pet example: how could you make it safe from deployment race conditions? One option would be to have a temporary database-level default that will prevent NULL fields until the code deployment is complete. A safe upgrade would involve 3 separate deployments:


This ignores the problem of frontend-backend compatibility, though. A truly safe upgrade might look something like this:


As you can see, things get complicated quite quickly! Unfortunately there's no way to avoid these issues in general; you either have to spend a lot of time thinking about backwards-compatibility or you have to accept some form of downtime (whether it's explicit downtime or "things not working right" for a period of time).

Solution 2: Explicitly Separate Your "Services"

Given that it's impossible to avoid thinking about backwards-compatibility with deploys, one way to prevent deployment races from slipping into your code is to explicitly separate non-atomic aspects of your codebase and test accordingly. At Rainforest, we decided that it was worth separating our backend and frontend code for exactly this reason.

To make this work, our frontend and backend code are in separate repositories with their own deployment pipelines. We even have two completely separate staging environments:


Since we test both backend and frontend code with Rainforest before deployment, bugs caused by backwards-incompatible changes are extremely rare.

Of course, this kind of setup is quite a lot of work to implement and maintain (as well as being expensive to run). You have to figure out if the tradeoff is worth it for your application. (We could implement a similar system for database migrations, for instance, but it hasn't been worth it so far.)

More Reading

If you want to go further in depth on this topic, there are a couple of great posts on deployment raciness in our deployment academy:


Related articles

The Role of QA Testing in Continuous Integration and Continuous Delivery

You can only move as fast as your QA process allows. That’s why in order to do continuous delivery, you need to adopt an equally continuous QA process.

Implementing Continuous Delivery: How We Ship Code at Rainforest

How the Rainforest team ships code, and what we're doing to make our deployment strategy better, faster and stronger with continuous delivery.

What is Continuous Integration and why use it?

You probably heard the news, CI is cool. In this post I'm going to walk you through the basics of what CI is, and why you need to use it, like now.

How Modular Testing Fits Into Your Continuous Testing Strategy

In this post, we explore modular testing, one facet of continuous testing, and learn how it contributes to a fast, reliable QA strategy.