Until late last year, Rainforest ran most of our production applications on Heroku. Heroku was a terrific platform for Rainforest in many ways: it allowed us to scale and remain agile without hiring a large Ops team, and the overall developer experience is unparalleled. But in 2018 it became clear that we were beginning to outgrow Heroku. We ended up moving to Google Cloud Platform (GCP) with most of our applications running on Google Kubernetes Engine (GKE); here's how we made the decision and picked our new DevOps tech stack.
We are heavy Postgres users at Rainforest, with most of our customer data in a single large database (eventually we will probably split our data into smaller independent services, but that's an engineering effort we don't want to dive into quite yet). In 2018 we were rapidly approaching the size limits of our Heroku database plan (1 TB at the time) and we didn't want to risk hitting any limits while still under contract with Heroku.
We were also running into limitations with Heroku on the compute side: some of our newer automation-based features involve running a large number of short-lived batch jobs, which doesn't work well on Heroku (due to the relatively high cost of computing resources). As a stopgap we've been running a few services on AWS Batch, but we've never been particularly happy with that solution since we have multiple compute environments that are vastly different from an operational perspective (very few engineers on the team have a deep understanding of Batch).
As Rainforest grows, application security is of tantamount importance, and the "standard" Heroku offering was becoming problematic due to the lack of flexibility around security (for instance, the inability to set up Postgres instances in private networks). At one point we attempted to migrate to Heroku Shield to address some of these issues, but we found that it wasn't a good fit for our application.
Perhaps surprisingly, cost was not the initial driving factor in the decision to move away from Heroku. Heroku has a reputation for being extremely expensive, but that hasn't been our experience in general: when factoring in the savings from keeping a lean Ops team Heroku was quite cost-effective when compared to the major cloud providers. This was especially true because the bulk of our hosting costs go towards databases, and managed Postgres service costs are similar across cloud providers (including Heroku).
Nevertheless, Heroku's costs were becoming an issue for a couple of reasons:
Heroku is very opinionated in how it expects you to run applications on its environment: all applications must follow 12-Factor guidelines to run well, and applications are always run in Heroku's dynos which are not terribly flexible. These restrictions come with significant benefits, though. 12-Factor apps are easy to scale horizontally and have very few dependencies on their environment, making them easy to run in local development environments and port to new production environments. We followed the 12-Factor guidelines very closely for our applications, and persistent data was stored exclusively in Postgres or third-party services like S3.
For autoscaling, we used HireFire for most of our Heroku applications. Web workers were generally scaled based on load, but background workers were scaled based on queue sizes of various kind. (This turned out to be a tricky feature to mimic in most other PaaS offerings.)
Given that we were moving away from Heroku, we needed a new platform that would run our applications without too much porting work. We could have skipped containerized solutions entirely and run our code directly on VMs (using tools like capistrano to perform the actual deployment), but we quickly discarded this option for a number of reasons:
In general, the industry is moving towards containerized deployment for precisely reasons like these, and we didn't see any compelling reasons to go against the trend.
AWS markets Elastic Beanstalk as their "easy-to-use" way to run containerized applications. While this theoretically seemed like an interesting option, initial experiments showed it to be far from easy to use in practice. Elastic Beanstalk has also not seen many significant updates in quite some time, so AWS's commitment to the product is unclear. It was an easy option to say no to.
One option we considered more seriously was Convox, which bills itself as an open-source alternative to Heroku (using AWS as its underlying infrastructure provider). Given that we fit their customer profile, the migration would probably have been fairly straightforward.
After some evaluation, though, we were concerned about relying on a platform with relatively little traction in the industry compared to the major cloud providers. Convox gives its customers direct access to underlying AWS resources, which is nice, but business changes at Convox could still have left us relying on an unsupported product—not a risk we were comfortable with for such a critical vendor. Convox was also missing a few key features related to autoscaling, which was the final nail in the coffin.
ECS is more or less a direct competitor to Kubernetes, offering a way to run containerized applications with a great deal of flexibility (at the cost of complexity). We already had some exposure to ECS through AWS Batch (which is a layer on top of ECS) and we weren't particularly impressed with the user experience. We also weren't keen on the amount of vendor lock-in we'd be accepting by using ECS (it would have been impossible, for instance, to set up a production-like environment on developer laptops), or happy about the amount of development work it would have taken to set up custom autoscaling and similar features.
If no better alternatives existed we might have settled on ECS, but thankfully that wasn't the case.
Kubernetes was the clear standout among the options we considered for a number of reasons:
Kubernetes' detractors often say that its complexity is overkill for many situations. While it's true that Kubernetes is an incredibly large and complicated piece of software, the basic abstractions are mostly intuitive and well thought-out and we've been able to side-step a lot of the complexity for a couple of reasons:
We had decided to use Kubernetes, so the question remained: which Kubernetes? Running a production-worthy Kubernetes cluster on raw VMs was not really a viable option for us (since our Ops team is still relatively small), so we evaluated managed Kubernetes services on the three most prominent cloud providers: AWS, GCP, and Azure.
Kubernetes was not our only requirement: we also needed managed Postgres and Redis services. This eliminated Azure as an option, since its managed Postgres service is relatively immature compared to AWS and GCP (with data size limits comparable to Heroku's). That left AWS and GCP, which were equally good choices in most respects: cost projections were remarkably similar, and both platforms offer a great range of managed services.
There was, however, a huge difference between GKE, the managed Kubernetes service on GCP, and EKS, AWS's equivalent. GKE is a far more mature product, with a number of essential features that EKS lacks:
Those differences only scratch the surface of the differences between GKE and EKS, but they were enough to eliminate EKS as a viable option.
With our big decisions made, we had to choose our new tech stack! When choosing technologies, we had a few guiding principles:
With those guidelines in mind, we settled on the following technologies:
There were also a few technologies that we considered but didn't make the cut for the initial transition:
In a future post, I'll cover the migration process itself.