How to Make Your Python Workers Scale Dynamically
We've recently had an interesting opportunity to experiment with deploying modern Python workers at Rainforest. We used to host most of our stack on Heroku, but it was not a good fit for this particular use case. This post explains the challenge we were facing and how we solved it, while also mentioning a bunch of cool tools that make development and deployment much less painful than it was a few years ago.
Dynamic Scaling for Python Workers
So what was our task? We wanted to run Python workers, each for some time between a few minutes and a few hours. We also wanted to have the flexibility to support thousands of workers running simultaneously during high-traffic periods without paying for the infrastructure in times of low demand - basically dynamic scaling. When you're used to building web applications, where a request taking more than 1s is considered bad and your demand is a little more constant, you have to slightly change your perspective and probably the tools you're using. Heroku can work great for some things, but dynamic scaling is not really one of them (things like HireFire can help, but we found that we wanted a bit more flexibility).
Other things that we wanted to support: modern Python with whatever parts of the scientific stack we want (numpy, scipy, scikit-learn, OpenCV) and easy deployment. I don't know about you, but following best practices of modern software development gives me a weird satisfaction, so making sure we have decent test coverage, CI/CD pipeline set up and an automatically-enforced coding style was fun. Don't judge me.
So to re-iterate, here's what we wanted:
- Scalability to thousands of parallel workers with no cost during down-times;
- Flexibility to run each worker for anywhere between a minute and an hour;
- Easy deployment of non-trivial code (e.g. C or Go extensions called from Python);
- Decent test coverage and a CI/CD setup;
- Modern Python (we don't necessarily want to use the version shipped with Ubuntu);
- Nice things like code style checks.
Before we get to the solution, let's make this concrete and lay out where all the requirements come from. Rainforest provides QA-as-a-Service solution, where our customers write tests in plain English and we distribute them to humans to perform. Some of these tasks are repetitive and we're working on automating as many of those as possible while leaving the ones that require human judgment ("Does this website look OK?") to humans.
Our testers perform tests by connecting to virtual machines we provision for them: each tester gets a fresh VM that is created especially for that test and destroyed afterward - which is pretty complex operationally, but a hard requirement for reproducibility. This looks roughly like below: our man application communicates with a bunch of VMs, where each VM is controlled by a human.
We want our automations to work in the same way, because there's a bunch of infrastructure we have on our VMs that we don't want to duplicate. As a consequence, our automation needs to effectively pretend to be humans and control the virtual machines using mouse/keyboard events, receive screenshots and make judgments about the state of the VM a how to proceed. A test can take anywhere from a minute to an hour to complete since it runs effectively and human speed, so we effectively want our infrastructure to look like this:
Turns out all this is totally doable and even pretty convenient! The tools we used are: Docker, AWS Batch, CircleCI,
black. What follows is a brief overview of each and how they can all fit together.
This is probably not very surprising, but Docker makes it pretty easy to ship code with whatever infrastructure and libraries you want to distribute. Do you want to build OpenCV yourself and use its Python bindings? Do you want to ship a Go library with your code? Do you want to call Ruby from Python? No problem, it's all pretty easy with Docker (doesn't mean that you should really do that last one). I've put up a sample Dockerfile as a Gist so you can see how this looks in practice.
This would be a good time to talk about
pyenv. It's a great tool for managing multiple Python versions on your machine. Under the hood it's a "just" bunch of shell scripts and it makes getting new version of Python as easy as
pyenv install 3.7.1. If you want to make use of new shine features like f-strings and data classes, this is probably your best option!
We've all been using
pip for years now to manage Python packages in our projects, but
Pipenv is a slightly more modern tool that has a bunch of nice advantages. I'll let you watch the video to get the details, but if you've ever used and liked
requests (especially if you had to deal with raw
urllib before that), you're likely going to enjoy using
Pipenv too - it was also written by Kenneth Reitz.
This is also a pretty standard recommendation, but testing is much nicer if you can use
pytest. Among other benefits, it gives you nice syntax for assertions, fixtures etc. and is pretty much the standard way to test Python applications nowadays.
We use CircleCI to serve all our CI/CD needs, which is useful for a couple of things: besides testing, you can also configure it to build your Docker containers and push them to a registry. We use AWS ECR, since we want to use them with Batch later on.
For reasons that are probably not entirely rational, of all the tools here, I enjoy
black the most. It's an auto-formatter for Python code that has basically no configuration, meaning that if you use it you have to accept its opinionated approach. And that's a great thing! Suddenly code style becomes something you don't even have to think about - you can just set up your code editor to auto-format your code on save and you can stop caring about which quotes to use, how to break up your lines etc. You press "save" and all your code magically becomes beautifully PEP8-compliant.
You can also go one step further and add a test to your test suite that will fail if any new code is committed, which does not follow the style (
Finally, a critical part of our infrastructure is AWS Batch - which basically handles the launching of however many workers we need and automatically scaling them up and down. Setting it all up correctly does take a bit of expertise and you'll have to write some CloudFormation YAML files, so you might want to ask a friendly ops person for some help.
That's basically it - at a cost of some set up you can have a pretty nice and flexible way to ship Python code for long-ish living workers. Hope that helps!
 This is no longer the case - we've recently migrated to Kubernetes on GCP. Which also has implications for this project, as we'll probably end up migrating this infrastructure to Kubernetes as well. Stay tuned for the future post about this!