Until late last year, Rainforest ran most of our production applications on Heroku. Heroku was a terrific platform for Rainforest in many ways: it allowed us to scale and remain agile without hiring a large Ops team, and the overall developer experience is unparalleled. But in 2018 it became clear that we were beginning to outgrow Heroku. We ended up moving to Google Cloud Platform (GCP) with most of our applications running on Google Kubernetes Engine (GKE); here’s how we made the decision and picked our new DevOps tech stack.
Rationale: The 3 Main Driving Factors Behind The Switch
We are heavy Postgres users at Rainforest, with most of our customer data in a single large database (eventually we will probably split our data into smaller independent services, but that’s an engineering effort we don’t want to dive into quite yet). In 2018 we were rapidly approaching the size limits of our Heroku database plan (1 TB at the time) and we didn’t want to risk hitting any limits while still under contract with Heroku.
We were also running into limitations with Heroku on the compute side: some of our newer automation-based features involve running a large number of short-lived batch jobs, which doesn’t work well on Heroku (due to the relatively high cost of computing resources). As a stopgap we’ve been running a few services on AWS Batch, but we’ve never been particularly happy with that solution since we have multiple compute environments that are vastly different from an operational perspective (very few engineers on the team have a deep understanding of Batch).
As Rainforest grows, application security is of tantamount importance, and the “standard” Heroku offering was becoming problematic due to the lack of flexibility around security (for instance, the inability to set up Postgres instances in private networks). At one point we attempted to migrate to Heroku Shield to address some of these issues, but we found that it wasn’t a good fit for our application.
Perhaps surprisingly, cost was not the initial driving factor in the decision to move away from Heroku. Heroku has a reputation for being extremely expensive, but that hasn’t been our experience in general: when factoring in the savings from keeping a lean Ops team Heroku was quite cost-effective when compared to the major cloud providers. This was especially true because the bulk of our hosting costs go towards databases, and managed Postgres service costs are similar across cloud providers (including Heroku).
Nevertheless, Heroku’s costs were becoming an issue for a couple of reasons:
- Heroku’s default runtime doesn’t include a number of security-related features that come “out of the box” with the major cloud providers, such as Virtual Private Cloud. Once those features become a requirement (which they were for us), Heroku becomes a much less cost-effective choice.
- GCP and AWS are both cheaper for raw computing resources than Heroku, and as mentioned earlier we haven’t been able to run all of our compute-intensive services on Heroku. When planning for future growth, we wanted a platform that could handle our web services and our more compute-intensive workloads with a common set of tooling.
Our Heroku Setup
Heroku is very opinionated in how it expects you to run applications on its environment: all applications must follow 12-Factor guidelines to run well, and applications are always run in Heroku’s dynos which are not terribly flexible. These restrictions come with significant benefits, though. 12-Factor apps are easy to scale horizontally and have very few dependencies on their environment, making them easy to run in local development environments and port to new production environments. We followed the 12-Factor guidelines very closely for our applications, and persistent data was stored exclusively in Postgres or third-party services like S3.
For autoscaling, we used HireFire for most of our Heroku applications. Web workers were generally scaled based on load, but background workers were scaled based on queue sizes of various kind. (This turned out to be a tricky feature to mimic in most other PaaS offerings.)
Given that we were moving away from Heroku, we needed a new platform that would run our applications without too much porting work. We could have skipped containerized solutions entirely and run our code directly on VMs (using tools like capistrano to perform the actual deployment), but we quickly discarded this option for a number of reasons:
- Our environment is heterogeneous: our biggest applications use Rails, but we also have smaller services written in Go, Elixir, Python, and Crystal. Maintaining separate deployment pipelines for each language would have been a major pain point.
- Setting up essential features such as autoscaling, high-availability, monitoring, and log aggregation would have involved significant development time, and it would have been virtually impossible to implement them in a vendor-agnostic way.
- Heroku behaves like a containerized environment (with similar technologies under the hood as Docker), and it was what our developers were used to. We would have had to see significant benefits to move to an alternative model.
In general, the industry is moving towards containerized deployment for precisely reasons like these, and we didn’t see any compelling reasons to go against the trend.
Evaluating Four Major Docker-based Platforms
AWS Elastic Beanstalk
AWS markets Elastic Beanstalk as their “easy-to-use” way to run containerized applications. While this theoretically seemed like an interesting option, initial experiments showed it to be far from easy to use in practice. Elastic Beanstalk has also not seen many significant updates in quite some time, so AWS’s commitment to the product is unclear. It was an easy option to say no to.
One option we considered more seriously was Convox, which bills itself as an open-source alternative to Heroku (using AWS as its underlying infrastructure provider). Given that we fit their customer profile, the migration would probably have been fairly straightforward.
After some evaluation, though, we were concerned about relying on a platform with relatively little traction in the industry compared to the major cloud providers. Convox gives its customers direct access to underlying AWS resources, which is nice, but business changes at Convox could still have left us relying on an unsupported product—not a risk we were comfortable with for such a critical vendor. Convox was also missing a few key features related to autoscaling, which was the final nail in the coffin.
AWS Elastic Container Service (ECS)
ECS is more or less a direct competitor to Kubernetes, offering a way to run containerized applications with a great deal of flexibility (at the cost of complexity). We already had some exposure to ECS through AWS Batch (which is a layer on top of ECS) and we weren’t particularly impressed with the user experience. We also weren’t keen on the amount of vendor lock-in we’d be accepting by using ECS (it would have been impossible, for instance, to set up a production-like environment on developer laptops), or happy about the amount of development work it would have taken to set up custom autoscaling and similar features.
If no better alternatives existed we might have settled on ECS, but thankfully that wasn’t the case.
Kubernetes was the clear standout among the options we considered for a number of reasons:
- Kubernetes has a huge amount of traction in the DevOps landscape, with managed implementations from all the major cloud vendors and virtually endless training materials and complementary technologies.
- Kubernetes is open source, which was a major plus: it meant that we could avoid vendor lock-in and implement local development environments that mimic production.
- Kubernetes has a large feature set that fit well with our requirements, including our more exotic necessities like autoscaling based on custom metrics.
Kubernetes’ detractors often say that its complexity is overkill for many situations. While it’s true that Kubernetes is an incredibly large and complicated piece of software, the basic abstractions are mostly intuitive and well thought-out and we’ve been able to side-step a lot of the complexity for a couple of reasons:
- Kubernetes is a natural platform for 12-Factor apps, which have no need for data persistence, statefulness, and other hairy issues.
- Using a managed Kubernetes service as a client is orders of magnitude easier than actually running a Kubernetes cluster.
Why Google Cloud Platform?
We had decided to use Kubernetes, so the question remained: which Kubernetes? Running a production-worthy Kubernetes cluster on raw VMs was not really a viable option for us (since our Ops team is still relatively small), so we evaluated managed Kubernetes services on the three most prominent cloud providers: AWS, GCP, and Azure.
Kubernetes was not our only requirement: we also needed managed Postgres and Redis services. This eliminated Azure as an option, since its managed Postgres service is relatively immature compared to AWS and GCP (with data size limits comparable to Heroku’s). That left AWS and GCP, which were equally good choices in most respects: cost projections were remarkably similar, and both platforms offer a great range of managed services.
There was, however, a huge difference between GKE, the managed Kubernetes service on GCP, and EKS, AWS’s equivalent. GKE is a far more mature product, with a number of essential features that EKS lacks:
- GKE manages the Kubernetes master and nodes, while EKS only manages the master. With EKS, we would have had to maintain the Kubernetes nodes completely, including maintaining security updates.
- GKE manages autoscaling at the cluster level and also has terrific support for horizontal pod autoscaling at the application level, including support for autoscaling on custom metrics. At the time of evaluation, EKS had no support for cluster-level autoscaling and extremely limited support for horizontal pod autoscaling of any kind.
Those differences only scratch the surface of the differences between GKE and EKS, but they were enough to eliminate EKS as a viable option.
The Tech Stack
With our big decisions made, we had to choose our new tech stack! When choosing technologies, we had a few guiding principles:
- Mostly managed: Our Ops Team is still quite small given the scope of its duties, so we wanted to minimize cases where they were responsible for running complicated software stacks. Our strong preference was for managed services where available.
- Minimize change: The migration was inevitably going to be a large change for the engineering team, but we wanted to make the transition as painless as possible. Where feasible, we wanted to keep our existing providers and practices in place.
- Boring where possible: The “Cloud Native” DevOps landscape is in an exciting and fast-moving phase, with new technologies springing up seemingly overnight. Rainforest’s hosting needs are generally quite simple, however: most of our services are “traditional” Postgres-backed web applications that communicate over REST APIs or message queues. While we appreciate the architectural flexibility that comes with Kubernetes (especially in comparison to Heroku), for the initial migration we decided not to go too far down the rabbit-hole of using “cutting-edge” auxiliary technologies that are not strictly necessary for our use-case.
With those guidelines in mind, we settled on the following technologies:
- Terraform: One of our more consequential early decisions was to move to infrastructure-as-code wherever possible. Terraform isn’t perfect, but it’s by far the most popular and complete option for managing infrastructure-as-code, especially on GCP. (We’ve used the transition as an “excuse” to bring many other aspects of our infrastructure under management by Terraform.)
- Google Kubernetes Engine: Given our decision to use Kubernetes, GKE was a no-brainer—it’s fully managed and has a very rich feature-set.
- Cloud SQL for PostgreSQL: Our Postgres databases are probably the single most critical part of our infrastructure, so it was important to find a managed Postgres service that supported the features we wanted (such as high availability, automated backups, and internal network connectivity). Cloud SQL fit the bill.
- Cloud Memorystore: We are relatively light users of Redis, but we do use it as a caching layer for some applications. Cloud Memorystore is a relatively no-frills Redis implementation but was good enough for our needs.
- Helm: Helm fills in some “missing pieces” for deploying to Kubernetes (for instance, templating and release management). We chose it over alternatives due to its large community and relative simplicity. For the actual deployment process, we use Cloud Build to build our applications’ Docker images and CircleCI to initiate releases.
- Stackdriver: Stackdriver is more or less the “default” logging and monitoring solution on GKE, and it has some integrations that were necessary for our implementation.
We were able to keep most of our other existing infrastructure tools (such as Cloudflare and Statuspage with minimal changes.
There were also a few technologies that we considered but didn’t make the cut for the initial transition:
- Istio: When we began the transition, installing and managing Istio was a manual process and seemed far too involved for our needs. GKE has since added built-in Istio support, which we may consider using in the future, but at our scale we don’t yet see the need for a service mesh.
- Vault: Vault has a number of compelling features for secrets management, but the fact that we would have to run it ourselves as a critical piece of infrastructure is a major disadvantage. We may consider adding it as part of a future infrastructure upgrade, however.
- Spinnaker, Weaveworks, and similar: Kubernetes allows for a huge amount of deployment flexibility, and there are a number of powerful CI/CD options that integrate with Kubernetes to implement things like customized deployment strategies. But we had a pre-existing CI/CD pipeline (using CircleCI) that we were quite happy with, so we decided to implement the minimal changes necessary to integrate with Kubernetes rather than try to implement something”fancier”.
Now that I’ve laid out our reasoning for moving from Heroku to Google Kubernetes Engine (GKE) and other GCP services, I’ll describe the actual migration process in detail. This isn’t designed as a how-to guide for migrating from Heroku to GKE—Google has their own excellent tutorial for that—but rather a description of some of the challenges of migrating real-world production applications and how we overcame them.
Moving our applications to GKE was a complicated project with many moving parts, but there were three main tasks we needed to accomplish:
- Setting up our Kubernetes clusters
- Porting our application to Kubernetes
- Migrating our Postgres databases from Heroku Postgres to Google Cloud SQL
The database migration was by far the most involved and risky of these tasks—once we moved our data to GCP it would be extremely difficult to roll back, so it had to work the first time. In order to minimize the risk, we wanted to make sure that the rest of the migration process was complete before doing the final database switch.
To accomplish this, we decided to temporarily run our applications on GKE pointing to our old databases on Heroku. This could only work if the GKE apps had low-latency connections to our Heroku databases, which were running in the AWS us-east-1 region; after some experimentation, we found that GCP us-east4 was the only region with acceptable latency to AWS us-east-1. Unfortunately, us-east4 didn’t have all of the services we needed (and also carried a price premium for some essential services), so we wanted GCP us-east1 to be our final destination.
With these constraints in mind, we came up with a multi-phase migration plan that was scheduled over a period of several months:
- Phase 1: Cluster setup: Set up production-ready GKE clusters in both us-east1 and us-east4.
- Phase 2: Application migration: Port applications to Kubernetes and start deploying application code to the GKE clusters in addition to Heroku, with the us-east4 versions of the applications pointing to the Heroku databases and the us-east1 versions using Cloud SQL.
- Phase 3: Application switchover: Switch production DNS to point to the us-east4 versions of the applications.
- Phase 4: Final switchover: Move data to the Cloud SQL instances and point production DNS to the us-east1 versions of the applications.
This plan had a few major advantages:
- The first three phases could be rolled out gradually without downtime.
- Before Phase 4, we could easily roll back any changes (this turned out to be a lifesaver, as we’ll see shortly).
- Phase 4 remained risky, but the risk was largely isolated to the database level—we were unlikely to run into surprises at the application level since the apps would have already been running on Kubernetes for a period of time.
We decided early on in the migration process to use infrastructure-as-code wherever feasible. This allowed us to use our normal code review process for infrastructure changes and meant that we could track changes through version control. We chose Terraform as our infrastructure management tool, since it is more or less the industry standard and has support for a wide variety of cloud vendors.
Thanks to GKE’s sane defaults and excellent Terraform integration, cluster setup was largely painless. After creating a Terraform module to ensure that our settings were consistent across our clusters, most of what we needed from Kubernetes worked essentially out-of-the-box. The only major cluster-level setup steps we couldn’t accomplish through GKE settings were:
- Installing Helm for use in our deployment pipeline.
- Setting up the custom metrics StackDriver integration so we could autoscale with custom metrics.
- Installing kube-state-metrics so we could alert on a wide variety of conditions in the cluster.
We used a simple shell script for that extra setup.
One of the nice things about GCP is that it makes it easy to set up separate projects to segment resources and permissions. We ended up with three primary projects:
- A staging project for development and QA
- A production project for our production GKE clusters and databases
- A project for building Docker images for all our applications using Cloud Build
The staging and production projects each had a corresponding GKE cluster, both of which had read-only access to the images project. This setup allowed us to give all of our developers full access to the staging environment while limiting production access to our Ops team and a few senior engineers. GCP makes cross-project IAM permissions very easy to handle, and all of the project configuration and permissions were set up through Terraform.
The Basic Approach
The applications we initially migrated were Rails applications that followed 12 factor guidelines closely, so porting them to Kubernetes was fairly straightforward. For most applications, the only code change we had to make was to add trivial health check endpoints for readiness probes. Beyond that, each application needed a few tedious steps to get it Kubernetes-ready:
- Making the Dockerfile
- Setting up a Helm chart with the appropriate Kubernetes configuration
- Adding Kubernetes deployments to CI/CD
To speed up the process of porting applications to Kubernetes, we made a Helm starter pack with a default application setup similar to the Heroku.
The only significant extra development work was adding support for autoscaling background workers based on queue size (to replace HireFire, which we had previously relied on). GKE has built-in support for autoscaling from StackDriver metrics, so the challenge was to get our custom queue size metrics into StackDriver. We accomplished this by exporting our metrics to Prometheus and then deploying the Prometheus StackDriver adapter as a sidecar container.
Hitting a Snag: Kubernetes CPU Limits
When we switched over our most busy customer-facing application to GKE, we were met with a nasty surprise. Our backend processing time showed a significant increase, more than could be accounted for by database latency alone:
This was disappointing, but we were still within our performance targets so it seemed like something we might have to live with. But when we made our first deployment in the new environment, things got much worse. Our pods started failing liveness probes and getting killed; pod availability temporarily dropped to 0 before eventually recovering.
Luckily we hadn’t yet moved our data to GCP, so we were in a position where we could quickly roll back to Heroku and investigate the issue. After some load testing, we found that the problem was that the first few requests to newly-created pods took an extremely long time to handle—on the order of 15-30 seconds. This meant that pods were getting killed by the liveness probe before they were fully spun up. The issue hadn’t been uncovered by our pre-production load tests because we hadn’t specifically tested deploying under load.
After trying a wide variety of tweaks and hacks, we eventually found the root cause of both the deployment and operating performance issues: Kubernetes CPU limits. When we had initially configured our web pods, we had specified both CPU requests and limits in order to get a QoS class of “Guaranteed”, which the docs seemed to suggest was best practice for high availability. We assumed that pods would only get throttled if they exceeded their CPU limits (which is what the relevant design docs suggest) and that throttling would not be an issue we ran into very often.
We were wrong on both counts. Both of the performance issues we encountered were caused by CPU limits: the Rails process seems to have been throttled on startup, which is why the initial requests to the pod were taking so long, and even after startup it seems the process was being throttled despite CPU utilization being well below its specified limit. When we removed CPU limits, both issues disappeared and performance returned to Heroku-like levels; we have since made it a policy to always avoid CPU limits when configuring applications on Kubernetes, and we haven’t noticed any adverse effects.
There’s a Github issue that suggests this is not an uncommon problem, and it’s disappointing that the trade-offs of CPU limits aren’t discussed more clearly in the Kubernetes documentation. Still, it was the only issue we encountered during the migration that caused unexpected downtime, which is a testament to the relative stability of Kubernetes for such a young platform.
The database migration turned out to be far more challenging than moving our applications to Kubernetes. We had three Postgres databases to move, each with several hundred gigabytes of data. The normal way to migrate Postgres databases of that size with minimal downtime is:
- Make an external replica using streaming replication or similar
- Wait for the replica to catch up to the master database
- Promote the replica to be the new master
Unfortunately, Heroku Postgres does not provide superuser permissions, which are necessary to set up replication, so using built-in Postgres replication wasn’t possible for us.
When replication isn’t available, the next best option is to make a copy of the data with pgdump and move it to the new database with pgrestore, which results in longer downtime but is a simpler procedure. When we tried that, the results were not promising: it was taking several days to complete the dump/restore cycle with the default settings—an obviously unacceptable amount of downtime.
We had to find an alternative approach.
Plan A: Hacked Up Replication
Our initial database migration strategy was to try to use replication without superuser permissions. The most promising option was an open source library called Teleport, which uses user-level Postgres triggers to simulate streaming replication. On paper, there was a lot to like about Teleport: it used CSV for an efficient initial dump/load without downtime, it was open source, and it was designed specifically for the use case we had in mind.
Evaluating Teleport was a long and tedious process, since testing its limits involved using production-like data sets. Each dump/restore test cycle took several days, and there were quite a few bugs that needed to be fixed before we successfully completed a full cycle. Our confidence in Teleport was not terribly high, but after several weeks of testing and tweaking it seemed as if the plan would work.
But when we turned on Teleport replication for a production database, all hell broke loose. Web requests started timing out, caused by database queries that never finished; investigation pointed to database locks that were being held by the Teleport dump process. We quickly turned off Teleport in order to regroup.
At that point we were faced with a difficult choice: we could continue trying to make Teleport work in a production setting, or we could abandon our strategy entirely. Given the difficulty of testing the database migration in a production-like environment and the risk of using software we had low confidence in, we chose the latter.
Plan B: Throw Hardware at the Problem
Our initial attempts at using pgdump and pgrestore were laughably unsuccessful, but after the Teleport strategy had failed, the dump/restore approach seemed like our only remaining option. One thing we hadn’t tried yet was to push the limits of the dump/restore process through more hardware and settings tweaks. After some research,1 we found several ways to speed up the process:
- Using the pg_dump directory format (-Fd)
- Turning on parallelism for both pg_dump and pg_restore (-j)
- Using a Google Compute Engine instance with local SSD scratch disks for driving the migration
- Tweaking the Cloud SQL target instances for higher performance, for instance by pre-provisioning large SSDs for improved throughput
This approach was very successful, reducing the migration process to a matter of hours rather than days. We determined that 6 hours of total downtime was acceptable to the business; to give ourselves enough time for preparation and cleanup, we aimed for a target dump/restore time of 4 hours, which we were able to achieve.2 It was a disappointingly long downtime period—the longest scheduled downtime in Rainforest history—but it was a small price to pay for the vastly reduced risk compared to the Teleport plan.
We knew that we only had one shot at executing the migration, so good preparation was essential. We wrote shell scripts for most steps and we made a detailed runbook with exact steps to execute, a laundry list of things to double-check along the way, and a rollback plan in case things went badly.
The week before the migration we executed a full “rehearsal” migration following the migration runbook with a production-sized dataset; it was well worth the effort, since we found some bugs in our scripts and it gave us a very good idea of exactly how much time the migration would take. The production migration went off without a hitch, taking almost exactly 6 hours.
Conclusion and Takeaways
Platform migrations are inherently long and complex projects, and Rainforest’s GCP migration was no exception: all-told, it took about six months from the start of the proof-of-concept to the point where our largest applications were running completely on GCP, and we still haven’t finished moving some of our auxiliary services off of Heroku.
Thinking back on the overall project, here are a few key lessons I took away, both from the successful aspects of the migration and the parts where we stumbled:
Limit Scope When Possible
When moving to a new platform, particularly one like Kubernetes that comes with a lot of hype, there’s always a temptation to move to the latest “best practices” in as many ways as possible—it’s easy to start harboring thoughts like “If we’re moving to Kubernetes, we might as well move to a microservice architecture as well”. As tempting as these ideas are, they can easily lead to scope creep, which is a key reason that many large projects fail.
For the GCP migration, I think we did a good job limiting scope to the necessities. It helped that our source and target environments were fairly similar, but it would have been easy to be tempted by fashionable “Cloud Native” technologies like service meshes, secrets management platforms, scalable time-series metrics toolkits, or high-performance RPC frameworks.
All of these technologies have their place, but they weren’t actually necessary for the success of the project, and by avoiding scope creep we were able to successfully complete the project within the expected time frame. Now that we have moved to our new platform, we can start to use these sorts of technologies when we actually need them without being restricted by a migration timeline.
Have a Rollback Plan If You Can
Early on in migration planning, we were leaning towards a “big bang” approach of moving all of our applications and databases to GCP during a single maintenance window. This might have worked with enough advance planning and on-call diligence, but the phased approach was far better for everyone’s sanity—when we ran into unexpected performance issues, we had the luxury of being able to quickly roll back to the old environment.
No amount of planning can cover all possibilities—in our case, we had tested running the application under load but not deploying under load—so the ability to roll back is the best insurance for the unexpected.
Prepare Like Crazy If You Can’t Roll Back
We couldn’t easily roll back the database migration, however, so we took a “measure twice, cut once” approach. Being involved in a six-hour mission-critical maintenance procedure is never fun, but it’s far less stressful and more likely to succeed if it has been practiced before and has an extremely detailed runbook. The “rehearsal” migration was a non-obvious and highly valuable part of the overall migration plan.
Simple Usually Beats Complicated
The one aspect of the project that did not go well was the original database migration plan—we spent at least a month pursuing a dead end trying to get Teleport to work for our needs. It’s not a coincidence that this was the most complex part of the overall migration plan: we had to account for the possibility of data loss or production instability, as well as possibly buggy code that we didn’t fully understand.
The dump/restore approach, though it had its drawbacks, was far simpler and more reliable; we probably should have pursued that approach to begin with.
Understand the Business Constraints
The reason we originally pursued the Teleport-based approach in the first place was to minimize downtime—as engineers, it’s easy to think of downtime as an inherent failure that should be avoided at all costs. It turns out that this was a mistake: when we proposed a process with more downtime, there was virtually no pushback from the rest of the business, and it seemed clear that the cost of downtime during a low-traffic period was less of a problem for our customers than the possibility of general instability or data loss.
Making the right technical decision can often depend on particular business circumstances—for some businesses, a long downtime could translate directly into lost revenue, for instance—so it’s important to make these sorts of decisions with a good understanding of the overall constraints.
1https://gitlab.com/yanar/Tuning/wikis/improve-pg-dump&restore was particularly helpful.
2 For the curious, the commands we used for dump/restore were