Why We Moved from Heroku to Google Kubernetes Engine
Until late last year, Rainforest ran most of our production applications on Heroku. Heroku was a terrific platform for Rainforest in many ways: it allowed us to scale and remain agile without hiring a large Ops team, and the overall developer experience is unparalleled. But in 2018 it became clear that we were beginning to outgrow Heroku. We ended up moving to Google Cloud Platform (GCP) with most of our applications running on Google Kubernetes Engine (GKE); here's how we made the decision and picked our new DevOps tech stack, and our new no-code automated testing.
Rationale: The 3 Main Driving Factors Behind The Switch
We are heavy Postgres users at Rainforest, with most of our customer data in a single large database (eventually we will probably split our data into smaller independent services, but that's an engineering effort we don't want to dive into quite yet). In 2018 we were rapidly approaching the size limits of our Heroku database plan (1 TB at the time) and we didn't want to risk hitting any limits while still under contract with Heroku.
We were also running into limitations with Heroku on the compute side: some of our newer automation-based features involve running a large number of short-lived batch jobs, which doesn't work well on Heroku (due to the relatively high cost of computing resources). As a stopgap we've been running a few services on AWS Batch, but we've never been particularly happy with that solution since we have multiple compute environments that are vastly different from an operational perspective (very few engineers on the team have a deep understanding of Batch).
As Rainforest grows, application security is of tantamount importance, and the "standard" Heroku offering was becoming problematic due to the lack of flexibility around security (for instance, the inability to set up Postgres instances in private networks). At one point we attempted to migrate to Heroku Shield to address some of these issues, but we found that it wasn't a good fit for our application.
Perhaps surprisingly, cost was not the initial driving factor in the decision to move away from Heroku. Heroku has a reputation for being extremely expensive, but that hasn't been our experience in general: when factoring in the savings from keeping a lean Ops team Heroku was quite cost-effective when compared to the major cloud providers. This was especially true because the bulk of our hosting costs go towards databases, and managed Postgres service costs are similar across cloud providers (including Heroku).
Nevertheless, Heroku's costs were becoming an issue for a couple of reasons:
- Heroku's default runtime doesn't include a number of security-related features that come "out of the box" with the major cloud providers, such as Virtual Private Cloud. Once those features become a requirement (which they were for us), Heroku becomes a much less cost-effective choice.
- GCP and AWS are both cheaper for raw computing resources than Heroku, and as mentioned earlier we haven't been able to run all of our compute-intensive services on Heroku. When planning for future growth, we wanted a platform that could handle our web services and our more compute-intensive workloads with a common set of tooling.
Our Heroku Setup
Heroku is very opinionated in how it expects you to run applications on its environment: all applications must follow 12-Factor guidelines to run well, and applications are always run in Heroku's dynos which are not terribly flexible. These restrictions come with significant benefits, though. 12-Factor apps are easy to scale horizontally and have very few dependencies on their environment, making them easy to run in local development environments and port to new production environments. We followed the 12-Factor guidelines very closely for our applications, and persistent data was stored exclusively in Postgres or third-party services like S3.
For autoscaling, we used HireFire for most of our Heroku applications. Web workers were generally scaled based on load, but background workers were scaled based on queue sizes of various kind. (This turned out to be a tricky feature to mimic in most other PaaS offerings.)
Given that we were moving away from Heroku, we needed a new platform that would run our applications without too much porting work. We could have skipped containerized solutions entirely and run our code directly on VMs (using tools like capistrano to perform the actual deployment), but we quickly discarded this option for a number of reasons:
- Our environment is heterogeneous: our biggest applications use Rails, but we also have smaller services written in Go, Elixir, Python, and Crystal. Maintaining separate deployment pipelines for each language would have been a major pain point.
- Setting up essential features such as autoscaling, high-availability, monitoring, and log aggregation would have involved significant development time, and it would have been virtually impossible to implement them in a vendor-agnostic way.
- Heroku behaves like a containerized environment (with similar technologies under the hood as Docker), and it was what our developers were used to. We would have had to see significant benefits to move to an alternative model.
In general, the industry is moving towards containerized deployment for precisely reasons like these, and we didn't see any compelling reasons to go against the trend.
With that in mind, we evaluated four major Docker-based platforms:
AWS markets Elastic Beanstalk as their "easy-to-use" way to run containerized applications. While this theoretically seemed like an interesting option, initial experiments showed it to be far from easy to use in practice. Elastic Beanstalk has also not seen many significant updates in quite some time, so AWS's commitment to the product is unclear. It was an easy option to say no to.
One option we considered more seriously was Convox, which bills itself as an open-source alternative to Heroku (using AWS as its underlying infrastructure provider). Given that we fit their customer profile, the migration would probably have been fairly straightforward.
After some evaluation, though, we were concerned about relying on a platform with relatively little traction in the industry compared to the major cloud providers. Convox gives its customers direct access to underlying AWS resources, which is nice, but business changes at Convox could still have left us relying on an unsupported product—not a risk we were comfortable with for such a critical vendor. Convox was also missing a few key features related to autoscaling, which was the final nail in the coffin.
ECS is more or less a direct competitor to Kubernetes, offering a way to run containerized applications with a great deal of flexibility (at the cost of complexity). We already had some exposure to ECS through AWS Batch (which is a layer on top of ECS) and we weren't particularly impressed with the user experience. We also weren't keen on the amount of vendor lock-in we'd be accepting by using ECS (it would have been impossible, for instance, to set up a production-like environment on developer laptops), or happy about the amount of development work it would have taken to set up custom autoscaling and similar features.
If no better alternatives existed we might have settled on ECS, but thankfully that wasn't the case.
Kubernetes was the clear standout among the options we considered for a number of reasons:
- Kubernetes has a huge amount of traction in the DevOps landscape, with managed implementations from all the major cloud vendors and virtually endless training materials and complementary technologies.
- Kubernetes is open source, which was a major plus: it meant that we could avoid vendor lock-in and implement local development environments that mimic production.
- Kubernetes has a large feature set that fit well with our requirements, including our more exotic necessities like autoscaling based on custom metrics.
Kubernetes' detractors often say that its complexity is overkill for many situations. While it's true that Kubernetes is an incredibly large and complicated piece of software, the basic abstractions are mostly intuitive and well thought-out and we've been able to side-step a lot of the complexity for a couple of reasons:
- Kubernetes is a natural platform for 12-Factor apps, which have no need for data persistence, statefulness, and other hairy issues.
- Using a managed Kubernetes service as a client is orders of magnitude easier than actually running a Kubernetes cluster.
Why Google Cloud Platform?
We had decided to use Kubernetes, so the question remained: which Kubernetes? Running a production-worthy Kubernetes cluster on raw VMs was not really a viable option for us (since our Ops team is still relatively small), so we evaluated managed Kubernetes services on the three most prominent cloud providers: AWS, GCP, and Azure.
Kubernetes was not our only requirement: we also needed managed Postgres and Redis services. This eliminated Azure as an option, since its managed Postgres service is relatively immature compared to AWS and GCP (with data size limits comparable to Heroku's). That left AWS and GCP, which were equally good choices in most respects: cost projections were remarkably similar, and both platforms offer a great range of managed services.
There was, however, a huge difference between GKE, the managed Kubernetes service on GCP, and EKS, AWS's equivalent. GKE is a far more mature product, with a number of essential features that EKS lacks:
- GKE manages the Kubernetes master and nodes, while EKS only manages the master. With EKS, we would have had to maintain the Kubernetes nodes completely, including maintaining security updates.
- GKE manages autoscaling at the cluster level and also has terrific support for horizontal pod autoscaling at the application level, including support for autoscaling on custom metrics. At the time of evaluation, EKS had no support for cluster-level autoscaling and extremely limited support for horizontal pod autoscaling of any kind.
Those differences only scratch the surface of the differences between GKE and EKS, but they were enough to eliminate EKS as a viable option.
The Tech Stack
With our big decisions made, we had to choose our new tech stack! When choosing technologies, we had a few guiding principles:
- Mostly managed: Our Ops Team is still quite small given the scope of its duties, so we wanted to minimize cases where they were responsible for running complicated software stacks. Our strong preference was for managed services where available.
- Minimize change: The migration was inevitably going to be a large change for the engineering team, but we wanted to make the transition as painless as possible. Where feasible, we wanted to keep our existing providers and practices in place.
- Boring where possible: The "Cloud Native" DevOps landscape is in an exciting and fast-moving phase, with new technologies springing up seemingly overnight. Rainforest's hosting needs are generally quite simple, however: most of our services are "traditional" Postgres-backed web applications that communicate over REST APIs or message queues. While we appreciate the architectural flexibility that comes with Kubernetes (especially in comparison to Heroku), for the initial migration we decided not to go too far down the rabbit-hole of using "cutting-edge" auxiliary technologies that are not strictly necessary for our use-case.
With those guidelines in mind, we settled on the following technologies:
- Terraform: One of our more consequential early decisions was to move to infrastructure-as-code wherever possible. Terraform isn't perfect, but it's by far the most popular and complete option for managing infrastructure-as-code, especially on GCP. (We've used the transition as an "excuse" to bring many other aspects of our infrastructure under management by Terraform.)
- Google Kubernetes Engine: Given our decision to use Kubernetes, GKE was a no-brainer—it's fully managed and has a very rich feature-set.
- Cloud SQL for PostgreSQL: Our Postgres databases are probably the single most critical part of our infrastructure, so it was important to find a managed Postgres service that supported the features we wanted (such as high availability, automated backups, and internal network connectivity). Cloud SQL fit the bill.
- Cloud Memorystore: We are relatively light users of Redis, but we do use it as a caching layer for some applications. Cloud Memorystore is a relatively no-frills Redis implementation but was good enough for our needs.
- Helm: Helm fills in some "missing pieces" for deploying to Kubernetes (for instance, templating and release management). We chose it over alternatives due to its large community and relative simplicity. For the actual deployment process, we use Cloud Build to build our applications' Docker images and CircleCI to initiate releases.
- Stackdriver: Stackdriver is more or less the "default" logging and monitoring solution on GKE, and it has some integrations that were necessary for our implementation.
There were also a few technologies that we considered but didn't make the cut for the initial transition:
- Istio: When we began the transition, installing and managing Istio was a manual process and seemed far too involved for our needs. GKE has since added built-in Istio support, which we may consider using in the future, but at our scale we don't yet see the need for a service mesh.
- Vault: Vault has a number of compelling features for secrets management, but the fact that we would have to run it ourselves as a critical piece of infrastructure is a major disadvantage. We may consider adding it as part of a future infrastructure upgrade, however.
- Spinnaker, Weaveworks, and similar: Kubernetes allows for a huge amount of deployment flexibility, and there are a number of powerful CI/CD options that integrate with Kubernetes to implement things like customized deployment strategies. But we had a pre-existing CI/CD pipeline (using CircleCI) that we were quite happy with, so we decided to implement the minimal changes necessary to integrate with Kubernetes rather than try to implement something"fancier".
In a future post, we will cover the migration process itself.