In my last post I laid out our reasoning for moving from Heroku to Google Kubernetes Engine (GKE) and other GCP services. Now I'll describe the actual migration process in detail. This isn't designed as a how-to guide for migrating from Heroku to GKE—Google has their own excellent tutorial for that—but rather a description of some of the challenges of migrating real-world production applications and how we overcame them.
Moving our applications to GKE was a complicated project with many moving parts, but there were three main tasks we needed to accomplish:
The database migration was by far the most involved and risky of these tasks—once we moved our data to GCP it would be extremely difficult to roll back, so it had to work the first time. In order to minimize the risk, we wanted to make sure that the rest of the migration process was complete before doing the final database switch.
To accomplish this, we decided to temporarily run our applications on GKE pointing to our old databases on Heroku. This could only work if the GKE apps had low-latency connections to our Heroku databases, which were running in the AWS us-east-1 region; after some experimentation, we found that GCP us-east4 was the only region with acceptable latency to AWS us-east-1. Unfortunately, us-east4 didn't have all of the services we needed (and also carried a price premium for some essential services), so we wanted GCP us-east1 to be our final destination.
With these constraints in mind, we came up with a multi-phase migration plan that was scheduled over a period of several months:
This plan had a few major advantages:
We decided early on in the migration process to use infrastructure-as-code wherever feasible. This allowed us to use our normal code review process for infrastructure changes and meant that we could track changes through version control. We chose Terraform as our infrastructure management tool, since it is more or less the industry standard and has support for a wide variety of cloud vendors.
Thanks to GKE's sane defaults and excellent Terraform integration, cluster setup was largely painless. After creating a Terraform module to ensure that our settings were consistent across our clusters, most of what we needed from Kubernetes worked essentially out-of-the-box. The only major cluster-level setup steps we couldn't accomplish through GKE settings were:
We used a simple shell script for that extra setup.
One of the nice things about GCP is that it makes it easy to set up separate projects to segment resources and permissions. We ended up with three primary projects:
The staging and production projects each had a corresponding GKE cluster, both of which had read-only access to the images project. This setup allowed us to give all of our developers full access to the staging environment while limiting production access to our Ops team and a few senior engineers. GCP makes cross-project IAM permissions very easy to handle, and all of the project configuration and permissions were set up through Terraform.
The applications we initially migrated were Rails applications that followed 12 factor guidelines closely, so porting them to Kubernetes was fairly straightforward. For most applications, the only code change we had to make was to add trivial health check endpoints for readiness probes. Beyond that, each application needed a few tedious steps to get it Kubernetes-ready:
To speed up the process of porting applications to Kubernetes, we made a Helm starter pack with a default application setup similar to the Heroku.
The only significant extra development work was adding support for autoscaling background workers based on queue size (to replace HireFire, which we had previously relied on). GKE has built-in support for autoscaling from StackDriver metrics, so the challenge was to get our custom queue size metrics into StackDriver. We accomplished this by exporting our metrics to Prometheus and then deploying the Prometheus StackDriver adapter as a sidecar container.
When we switched over our most busy customer-facing application to GKE, we were met with a nasty surprise. Our backend processing time showed a significant increase, more than could be accounted for by database latency alone:
This was disappointing, but we were still within our performance targets so it seemed like something we might have to live with. But when we made our first deployment in the new environment, things got much worse. Our pods started failing liveness probes and getting killed; pod availability temporarily dropped to 0 before eventually recovering.
Luckily we hadn't yet moved our data to GCP, so we were in a position where we could quickly roll back to Heroku and investigate the issue. After some load testing, we found that the problem was that the first few requests to newly-created pods took an extremely long time to handle—on the order of 15-30 seconds. This meant that pods were getting killed by the liveness probe before they were fully spun up. The issue hadn't been uncovered by our pre-production load tests because we hadn't specifically tested deploying under load.
After trying a wide variety of tweaks and hacks, we eventually found the root cause of both the deployment and operating performance issues: Kubernetes CPU limits. When we had initially configured our web pods, we had specified both CPU requests and limits in order to get a QoS class of "Guaranteed", which the docs seemed to suggest was best practice for high availability. We assumed that pods would only get throttled if they exceeded their CPU limits (which is what the relevant design docs suggest) and that throttling would not be an issue we ran into very often.
We were wrong on both counts. Both of the performance issues we encountered were caused by CPU limits: the Rails process seems to have been throttled on startup, which is why the initial requests to the pod were taking so long, and even after startup it seems the process was being throttled despite CPU utilization being well below its specified limit. When we removed CPU limits, both issues disappeared and performance returned to Heroku-like levels; we have since made it a policy to always avoid CPU limits when configuring applications on Kubernetes, and we haven't noticed any adverse effects.
There's a Github issue that suggests this is not an uncommon problem, and it's disappointing that the trade-offs of CPU limits aren't discussed more clearly in the Kubernetes documentation. Still, it was the only issue we encountered during the migration that caused unexpected downtime, which is a testament to the relative stability of Kubernetes for such a young platform.
The database migration turned out to be far more challenging than moving our applications to Kubernetes. We had three Postgres databases to move, each with several hundred gigabytes of data. The normal way to migrate Postgres databases of that size with minimal downtime is:
Unfortunately, Heroku Postgres does not provide superuser permissions, which are necessary to set up replication, so using built-in Postgres replication wasn't possible for us.
When replication isn't available, the next best option is to make a copy of the data with pgdump and move it to the new database with pgrestore, which results in longer downtime but is a simpler procedure. When we tried that, the results were not promising: it was taking several days to complete the dump/restore cycle with the default settings—an obviously unacceptable amount of downtime.
We had to find an alternative approach.
Our initial database migration strategy was to try to use replication without superuser permissions. The most promising option was an open source library called Teleport, which uses user-level Postgres triggers to simulate streaming replication. On paper, there was a lot to like about Teleport: it used CSV for an efficient initial dump/load without downtime, it was open source, and it was designed specifically for the use case we had in mind.
Evaluating Teleport was a long and tedious process, since testing its limits involved using production-like data sets. Each dump/restore test cycle took several days, and there were quite a few bugs that needed to be fixed before we successfully completed a full cycle. Our confidence in Teleport was not terribly high, but after several weeks of testing and tweaking it seemed as if the plan would work.
But when we turned on Teleport replication for a production database, all hell broke loose. Web requests started timing out, caused by database queries that never finished; investigation pointed to database locks that were being held by the Teleport dump process. We quickly turned off Teleport in order to regroup.
At that point we were faced with a difficult choice: we could continue trying to make Teleport work in a production setting, or we could abandon our strategy entirely. Given the difficulty of testing the database migration in a production-like environment and the risk of using software we had low confidence in, we chose the latter.
Our initial attempts at using pgdump and pgrestore were laughably unsuccessful, but after the Teleport strategy had failed, the dump/restore approach seemed like our only remaining option. One thing we hadn't tried yet was to push the limits of the dump/restore process through more hardware and settings tweaks. After some research,1 we found several ways to speed up the process:
This approach was very successful, reducing the migration process to a matter of hours rather than days. We determined that 6 hours of total downtime was acceptable to the business; to give ourselves enough time for preparation and cleanup, we aimed for a target dump/restore time of 4 hours, which we were able to achieve.2 It was a disappointingly long downtime period—the longest scheduled downtime in Rainforest history—but it was a small price to pay for the vastly reduced risk compared to the Teleport plan.
We knew that we only had one shot at executing the migration, so good preparation was essential. We wrote shell scripts for most steps and we made a detailed runbook with exact steps to execute, a laundry list of things to double-check along the way, and a rollback plan in case things went badly.
The week before the migration we executed a full "rehearsal" migration following the migration runbook with a production-sized dataset; it was well worth the effort, since we found some bugs in our scripts and it gave us a very good idea of exactly how much time the migration would take. The production migration went off without a hitch, taking almost exactly 6 hours.
Platform migrations are inherently long and complex projects, and Rainforest's GCP migration was no exception: all-told, it took about six months from the start of the proof-of-concept to the point where our largest applications were running completely on GCP, and we still haven't finished moving some of our auxiliary services off of Heroku.
Thinking back on the overall project, here are a few key lessons I took away, both from the successful aspects of the migration and the parts where we stumbled:
When moving to a new platform, particularly one like Kubernetes that comes with a lot of hype, there's always a temptation to move to the latest "best practices" in as many ways as possible—it's easy to start harboring thoughts like "If we're moving to Kubernetes, we might as well move to a microservice architecture as well". As tempting as these ideas are, they can easily lead to scope creep, which is a key reason that many large projects fail.
For the GCP migration, I think we did a good job limiting scope to the necessities. It helped that our source and target environments were fairly similar, but it would have been easy to be tempted by fashionable "Cloud Native" technologies like service meshes, secrets management platforms, scalable time-series metrics toolkits, or high-performance RPC frameworks.
All of these technologies have their place, but they weren't actually necessary for the success of the project, and by avoiding scope creep we were able to successfully complete the project within the expected time frame. Now that we have moved to our new platform, we can start to use these sorts of technologies when we actually need them without being restricted by a migration timeline.
Early on in migration planning, we were leaning towards a "big bang" approach of moving all of our applications and databases to GCP during a single maintenance window. This might have worked with enough advance planning and on-call diligence, but the phased approach was far better for everyone's sanity—when we ran into unexpected performance issues, we had the luxury of being able to quickly roll back to the old environment.
No amount of planning can cover all possibilities—in our case, we had tested running the application under load but not deploying under load—so the ability to roll back is the best insurance for the unexpected.
We couldn't easily roll back the database migration, however, so we took a "measure twice, cut once" approach. Being involved in a six-hour mission-critical maintenance procedure is never fun, but it's far less stressful and more likely to succeed if it has been practiced before and has an extremely detailed runbook. The "rehearsal" migration was a non-obvious and highly valuable part of the overall migration plan.
The one aspect of the project that did not go well was the original database migration plan—we spent at least a month pursuing a dead end trying to get Teleport to work for our needs. It's not a coincidence that this was the most complex part of the overall migration plan: we had to account for the possibility of data loss or production instability, as well as possibly buggy code that we didn't fully understand.
The dump/restore approach, though it had its drawbacks, was far simpler and more reliable; we probably should have pursued that approach to begin with.
The reason we originally pursued the Teleport-based approach in the first place was to minimize downtime—as engineers, it's easy to think of downtime as an inherent failure that should be avoided at all costs. It turns out that this was a mistake: when we proposed a process with more downtime, there was virtually no pushback from the rest of the business, and it seemed clear that the cost of downtime during a low-traffic period was less of a problem for our customers than the possibility of general instability or data loss.
Making the right technical decision can often depend on particular business circumstances—for some businesses, a long downtime could translate directly into lost revenue, for instance—so it's important to make these sorts of decisions with a good understanding of the overall constraints.
1https://gitlab.com/yanar/Tuning/wikis/improve-pg-dump&restore was particularly helpful.
2 For the curious, the commands we used for dump/restore were pgdump -d "$SOURCEURL" -n public -j 8 --no-owner --no-acl -Fd -f "$dumpdir" and pgrestore -d "$TARGETURL" -j 16 --no-owner --no-acl -Fd "$dumpdir"