AWS re:Invent Recap: Measuring Software Quality with Mechanical Turk
Did you make it to AWS re:invent in 2015? We did, and we’ve got the video to prove it! Rainforest QA co-founders Fred Stevens-Smith and Russell Smith gave a talk at our breakout session, “Continuous QA: Measuring Software Quality with Rainforest and Mechanical Turk.”
To kick the tension up a notch, they started the session with a live demo of the Rainforest QA platform, using the website of one of the audience members to run a simple test. While the test ran, Fred and Russ discussed the need for faster, more reliable QA testing that can keep pace with the CI and CD practices of fast-moving teams. They explained the role of Mechanical Turk in building a reliable, on-demand network of QA testers that helps give Rainforest QA customers the testing bandwidth they need to move fast and build great software.
Check out the video of the session below (or scroll down to the transcript) to hear what Fred and Russell had to say about leveraging Mechanical Turk to power Rainforest’s network of 50,000 testers, and to see the results of their live demo.
We look forward to seeing you in Vegas next year!
Transcript of AWS Breakout Session:
[beginning of transcription]:
Fred: Hello everybody! I'm Fred. I'm the CEO of Rainforest. This is my co-founder Russell, and we're going to talk to you a bit about how we do continuous SQA and maybe give some ideas for those of you who are thinking of doing something similar. We're also going to talk quite a lot about how we use Amazons humans through an API service Mechanical Turk to power that. Just because we want to double down on the anxiety in the room, we thought we might do a live demo unprepared against one of your websites. Does anybody have the web application that they would like to volunteer to be tested? I'm seeing lots of head shaking ... there's a hand.
Audience Member: Grafiq.com with a q.
Fred: Grafiq.com with a q like this?
Audience Member: Yes
Fred: Do you have ... do we have some sort of functionality like a signup or something ...
Audience Member: Scroll down.
Fred: Vertical ... let’s try this guy and then a search. This is perfect. What we'll do is we will create a couple of simple tests against it and then will run them and then once you're going through our slides they'll attribute some results from the thousands of mechanical Turk testers who are online right now. Let’s whiz through this. We're going to ... I'm going to show you indirectly Rainforest ... can everyone see?
Russell: Yeah maybe a little bigger.
Fred: Basically this is rain forest itself the actual application. This is where customers write and maintain their test suite. If we go in here, we will add graphiq with a q to our sites list. it's clearly production. This live on Axlegeeks?
Audience Member: Yeah
Fred: We can use this. Will Axlegeeks be angry with us?
Audience Member: No
Fred: If we find anybody bugs they might be. Okay, let's create a test. We are just going to test the search. It looks Axlegeeks is a site for cars ... maybe that's just. Basically, the core of testing in Rainforest is what's called steps, and a steps is basically how you specify a test. A step is an action and a question in natural language. There is two fundamental ideas behind Rainforest. The first is that each test should be relatively tight in scope. Test one piece of functionality on its own but the steps themselves can be very high level. To show you what I mean here given that we want to test the search ... what should we test it too? search for a car and then what are they expecting. Let’s try my trusty Honda Accord ... we'll get a list of search results hopefully. Nice! Okay.
Do you see a list of cars ... a relevant list of cars? There we go. That's our test, and it's simple. Let's add on some filtering stuff. Filter by vehicle type ... let's add something like ... filter by vehicle type and then does the page update correctly. You can be super high level like we are with this tests or you can use Rainforest more like an automation where you get down and dirty and tell them like click on this particular input, enter this particular string and the kind of beauty of our approach or how we think of it ... the beauty of our approach is that you can use the human brain of the human testers who back our service to get away with this kind of high-level instructions.
We have a test let's run it in all the browsers why not? ... there we go. We have an ETA 12:04 p.m. perfect. This this is now running, and we will check in with that periodically throughout the demo okay. Let’s talk a bit about what Rainforest is ... if this loads. Can you see? Let's try like that. We'll just go through like that. What Rainforest is ... well, first of all, we're a company ... we're based in San Francisco right now we have about ... half of the teams in San Francisco half of the team all over the world. very small tiny little company only 23 people right now. We have about 50,000 monthly active testers.
The vast majority of all these testers ... you can think of us as basically having a giant 50,000 person QA team, and we manage this QA team programmatically. The QA team is sourced from mechanical Turk. What is QA I mean ... basically, QA is about finding bugs in an efficient way as possible that's how we think about it. Everyone knows what QA is if you came to the session. More importantly, the key idea behind Rainforest was making this notion of QA compatible with the notion of continuous delivery. Russell and I both had many times built out a deployment and delivery process at various different places. Every time the bottleneck was the QA process.
I'm sure that's very familiar to people here either you invest a ton of time in engineering energy in automation so that you can have QA that's fast enough so that you can do continues delivery or your bottle-necked ... you're held back from continuous delivery by the QA cycle. That's like the fundamental kind of problem that we set out to solve with Rainforest. When we have conversations with people VPA's and CTO's we, this is what we here all the time. Basically, I'm moving fast, and I'm breaking things, and I'm really effing terrified about what's in production that I don't know about, or I'm not able ... we're not able as an organization to embrace continues delivery because we don't have a fast enough QA process and we need to do this set of regressions every time we release because we're an enterprise software or finance or whatever it is.
We think of this thing could continuous QA as basically being the bridge between these two worlds. The bridge between CI ... you're running all of your units tests and stuff in development continuously and continuous delivery. We think of it as the bridge between those two. Why continuous QA? I mean this is also like asking the question why continues delivery. I probably don't have to sell anybody here on this I guess but just to whiz through this I think is bank of America going to be shipping to production 20 times a day next year? Almost certainly not right?
Will the bank of Americas engineering organization have internalized the notion that shipping as often as possible in smaller chunks of code as possible is like the least risky most efficient way to do things? probably yes. We think the whole world is moving towards this continuous delivery mode. What Rainforest does or what continuous SQA does is basically makes QA manual powered regression testing a seamless part of that. Then there's a bunch of stuff about basically what does continuous SQA look like ... I think the important stuff is that it has to be available 24/7 right not like maybe your QA team in house or maybe you have one in India right these guys have a time zone they have working hours.
Ultimately your developers might want to push to production at 3 am and one of our customers does that. 24/7 availability is important. It has to be super-fast because as we all know as soon as you start sitting there waiting for your stuff to hit production the longer it takes more distracted you are, and I left valuable that whole continuous delivery Idea is. It has to be cross-browser ... I think most of us today probably don't test comprehensive across all the browsers our customers use or important for the business just because it takes too much time. I think the most important thing for us was ... it has to have a really great API so that you can run and retrieve your results all programmatically with the core objective to ultimately being the human work that you and your team do is creating and maintaining the test cases but then the runs and the results are all kicked off as part of a continuous delivery push manage circle or bamboo or whatever that you're using.
A bit about Rainforest -- or a bit more about Rainforest I suppose. On average, we have results in under 30 minutes we aim for 20 minutes we found 20 minutes to be the sweet spot for continuous delivery. By results, we mean an average test suite size of 50 tests running an average of 5 browsers that average completion time all of those tests take to come back to you it's about 20 minutes. It's incredibly fast and how we do that as you will learn in a second is through massively parallelized execution by standing 500,000 200,000 humans to your staging environment all at once. As you can see that's 50,000 testers, that's the monthly active number. Daily active hovers between 5 and 10000 24/7 availability and fully cross-browser.
You may be thinking well this will sound fine but where does this fit ... why do I need this? We think that there're two main ways to do QA today. You can ... and this is a slide from my fundraising days. My apologies for the eligibility of it. There're two basic places to do QA today right? You can hire humans and saw that a spectrum all the way from having your pool product managers do it in house or your support people all the way to having a 5000 person outsourced organization, or you can do automation. I think the fundamental thesis of Rainforest is that automation isn't the answer for the majority of web and mobile application testing.
I think that probably there's lots diverse perspectives about how effective automation can be within the room. Ultimately what we've seen is that having a test code base leaving and piggybacking on top of your actual application code base eventually slows you down or requires you to invest huge amounts into quality engineering. We think that automation can co-exist with a human power QA if that human power QA is fast enough. What most of our customers do they see us as the first QA hire right? Zenofits is a great example. We have Zenofits developer in the audience, and he's looking worried. How Zenofits started with us ... they started using Rainforest when they were 18 people. They were looking and thinking should we hire a QA person should we invest in quality engineering ... what approach should we take here? Basically, what happened was they chose Rainforest.
Now they have 150 people in the engineering organization and 1,800 people in the total company. We are the kind of human verification layer on top of their automation. Basically, as organizations evolve and as your testing needs evolve Rainforest can play a different role depending on what you're looking for. We are going to get past the whole pitchy stuff very quickly don't worry. Handing over to my brilliant co-founder Russell to talk about mechanical Turk.
Russell: Has anybody in the used mechanical Turk? Cool, I'd love to talk to you afterwards. For those of you who don't know, MTurk is a little-loved service from Amazon, but it's cool. It's definitely underused it's an API for doing micro-tasking. It's powered by a human. They have humans all over the world that sign up to get paid to do tiny tasks. you can add tasks to it using the interface or using an API which is what most of the bigger requester tend to do. It's ... the coolest thing is that's its elastic. unlike ... something, unlike contractors, were you would have to hire a thousand people to be around all the time. You can put work on and then It'll be done and how quickly you want it done is not really configurable, but you can play with that based on how much money you charging and how you treat your works.
it's possible to do low latency on MTurk as you will see from our demo ... you can't see this now but when you start a Rainforest one within 20 seconds to 60 seconds we've got a few hundred workers starting the test. It's very cool, and price is configurable by you as the end user of MTurk. You can pay them anything from a cent up it works ... you can see it's failing on the right-hand side if you scroll down there's a list of ... live list of people coming into Rainforest and doing tests. If you go back to the slides ... that's a very brief bit about MTurk. Rainforest and MTurk ... we've basically built a lot of custom stuff on top of MTurk to make it do what we want. We take workers who don't know anything about QA necessarily ... they may but they may not and we automatically train them.
There's a whole system that we've built that takes a fresh worker who knows how to use a computer but may not know anything about QA. We teach them our style of QA testing automatically. Behind the scenes as well there's a lot of technology to make sure we have good results and the workers that aren't doing so well are trained, and if they are not doing well after that they are removed from our system, and they go do other work on other MTurk stuff. I think the key part of rainforest and why it's useful is that we've given the testers the correct tooling. After we've trained them we have ... I'll show you this in another demo we're going to do. We give them the stuff they need to do the tests. Depending on what you doing maybe a virtual machine with Windows or it may be a phone sometimes at some point soon will try and get the MAC's as well. Today it's windows and phones Android phones to be specific.
We connect those directly into their browser, so they don't have to have any tooling at all, and they don't have to download anything which is pretty cool. Rainforest is ... you can think of it as a two-sided marketplace for testing. Customers pay us we put workout and workers get paid to do it. We manage the entire life cycle of testers as well. From hiring them to training to ranking them up or also firing them. The tooling that we have today ... we run in a custom infrastructure. It's dedicated servers, unfortunately, can't yet use EC2, but we're working on that. That's basically powering the virtual machine that we give them.
We have a cluster of servers, and they come in request and they get given the correct machine for them. Whether that's a Windows machine or something with some custom software, it's directly put in their browser which is pretty cool. Because we do this, it means we can do lots of crazy stuff. We can record all the network traffic in and out the box for you and the end user to replay also for fraud detection and quality We can tell what the worker is doing down to an extremely granular level which is pretty cool.
We support all desktop stuff so were using KDM to run a lot of virtual machines. We're using Android as well. We're working on using device cloud ... device firm I'm getting that wrong in a crazy way to let us use all the firms that Amazon now kindly supports. Soon we will support firms which is cool. Traditional workflow for developers is an interesting one to think about. Most [orgs 00:18:10] that have fun kind of QA who even if they don't have QA end up doing something like this. They'll write code is an individual developer, and then they'll write tests, and then that push it to CI and ask it to be reviewed, and then maybe they'll get QA-id sometime later.
Basically, it's normally requested the department QA's it if there is a separate department. There's just ... if you look at the timings, we put this is rough timings. The problems are when you fail ... when something goes wrong, it normally gets pushed back nearly to the first step. Maybe to the test if the test was badly written but you end up wasting a ton of time every time the cycle breaks plus the cycle is slow. This is obviously the only way you can do QA with a traditional QA org but with Rainforest, the workflow is slightly different. You can write your code and the developer, or a PM can write the test ... we say PM as well because there is no code involved in writing our very first tests. literal, plain English not even cucumber style English.
It's like the literal English. It's like you’re talking to a human thanks to MTurk. Then we can run your QA in a CI environment as we. We can ... we have technology ... some of that [Emgroup 00:19:33] if you have ever used [Emgroup 00:19:34] that lets us tunnel into a build server and send the humans from MTurk directly into that. For some people, that's working very well. [Zenth 00:19:43] as a good example of using that. Every time a build happens on a future bunch they can also fully QA that end to end which is kind of crazy. Then because it's also in API code, we have a command line interface that fits your CI. If you're running aspect or Junit or whatever you just add another line which is Rainforest, and then your test executes. This is a few of the people that use ... some of you may have heard of I guess the bigger ones clouden which is IBM and Zenofits or better works a pretty big startups. Theirs a tonne of people using this is anybody ever wants to talk to them let us know. Fred this one for you
Fred: ROI is probably not the ideal audience to be talking about ROI because it's a bit business. We did a very simplified case study with Zenofits and I think the key things to think about are not just the cost of hiring the humans to execute and or maintain your test cases but also the throughput that's required right? I think the fundamental thing that's probably different about Rainforest is that we are totally on demand. Because we have this on demand crowd backing us, we are able to deliver QA execution AKA like results knowing what works and what doesn't work on demand. I think that's important because as all of us know when we building software a demand for testing a demand to understand whether stuff is working or not working is very spiky right? I want to release I want all my test faces done now. I don't want anything for another 6 hours until I need them done again.
That space gives some rough ROI calculation about what Zenofits would have to do today if they didn't have Rainforest. Let's get back to the demo which is much more interesting. Let’s have a ... look like we go to failure here. As Russell was saying ... and will walk you through the interface after this but we thought this might be fun ... fun for us at least. As Russell was saying, we immediately start sending ... am I ... have bounced off the Wi-Fi here?
Russell: No you're good.
Fred: Looks like am slightly ... there we go. As Russell was saying, you can see the humans that we're sending down here on the right. If I just zoomed into this a little bit ... each one of these little faces ... and this is streaming it'll all jump across the screen annoyingly, but each one of these little faces is an individual tester. If you scroll down here, you can see for example search task changed from previous supervision incomplete. This tester whose name was Parker little changed this result from no result to other meaning he came in he executed the test and being unable to pass or fail the test for some reason he clicked other. The other one worked.
let's show some steps here and yeah we'll work through that. We have failures in all browsers in the first step which is probably not super expected and left clicking and find out why. Just to show you what we doing here if we can expand this to see the steps you can see the individual steps that we wrote just a little bit earlier and then you can see each of the Browsers that we ran the test against. You just have this kind of matrix of where did it pass and fail in which browser when right? You can very quickly zero in. Typically we wouldn't expect to see all of these, and I think this might be because I forgot to change the site. Let's have a look oh so looks like maybe we took Axlegeeks down or it went down for some reason.
Audience Member: Looks like it's 500.
Fred: Looks like it's 500. Well, that's too bad. Sorry to you Mr.Axlegeeks I apologize if we did something bad there. What's interesting here is that you can see will send 3 testers by default. We'll send 3 individual humans that will be given the same set of instructions against the same environment and now do the testing together. If I log out of our account ... don't worry I'll show you the screen again. If I log out of our account ... if I log out at this demo account and login to our own account where it starts to get a bit [inaudible 00:24:31] because we test Rainforest on Rainforest. I'll jump into the results before we go and show you some of the stuff.
let's find one with a failure. We do continuous delivery we have a subset of 44 task that gets run every single time we deployed. We use Circle CI standard get branching workflow basically on every pull request Rainforest runs on every new commit. We are a blocking step in our push to production which happens automatically once code hits develop. If we go to find a failure in one of these runs ... here we go. This was yesterday or is that yesterday 2 days ago at 9:19 a.m. took 40 minutes to do 44 tests. You can see the speed fluctuates right? because it's powered by humans, so you have up here the same run taking 21 minutes here it's 25 this one takes 40. let's drill into it.
We can see our failed test and our passing tests. We can expand those and see all of this tests passed. If we drill into one, in particular, we can see that view that you saw before but with some nice green things. If we click into any of these, you can see the actual testers who took this particular test in this particular browser. If I click on that, I now see our three testers. We have a fanny or Kenny tease and Maria. Basically, what you see here is each one of these rows is an individual step, and then each of the columns is one of the testers. Because it's all happening on our virtual machines like Russell said were able to record everything that is happening pass out the screenshot and shows you that to you with a customer.
Obviously useful if you like we'll how to do ... is to that fail I've never seen that failed before. You can go in and see the verification yourself. That's kind of one useful thing here, and I think what's interesting and we didn't really plan this but I think you probably be interested in it is to look at the admin level view of that same job right? You can kind of get an idea of what is going on behind the scenes. If we find our check tags ... here we go and then look at Firefox this is where you can see what we see on our end. If I enlarge this a little bit ... you can see I'm not sure if you can see that but ... yeah, you can. You can see you that we are tracking everything the tester is doing in terms of where they are moving their mouse where they are clicking all of that kind of stuff.
I'm not sure if you can read that, but it says click keystrokes pastes and scrolls. We're tracking all of those individual kinds of events that the tester is doing and then on top of that were tracking VM's overtime. They have instructions, and then they have a terminal with the VM within it. What we do is we use deep learning which we learned about a couple of weeks ago at our onsite. We're using deep which I don't understand so don't ask me any questions about it ... to basically figure out if one of these testers is behaving fraudulently or like executing the test incorrectly. If it got lost or they started messing up or as it sometimes a problem when you're paying people to do jobs at arm’s length they just clicking kind of yes because they want to get paid and want to leave.
Going back to the original point which was that we basically have a 50000 person QA team that is managed programmatically right? If you think about how you manage other the humans, you have to do a combination of things. It's about having a carrot and having a stick. For us, it's exactly the same, but it's all programmatic. This just gives you a sense of the level of data collection that we doing right now and this is what enables us to give good results the end customer because if you were to just say hey I'm going to spin up 50,000 people I'm going to get them all to execute tests for me, and I'm going to trust them it's not going to work. As everyone knows from managing humans.
If you look at our failed tests, you can see that we failed on Firefox for whatever reason and just drilling a little bit deeper into this so you can understand how pass it. The result is this row so other is the middle result between yes and no and then you sees that these two guys were approved by Rainforest. This means that algorithmically we basically decided this guy’s reputation and their agreement on sudden results was sufficient for us to return that to the customer as an actual failure. This guy on the right was rejected because he passed the test in disagreement with these guys. What happened here wrong email or password didn't work we can drill in and verify we got that error message ... doesn't load page and then this guy and you speak that's interesting thing. This guy passed the test.
This test says can you log in, and it gives you an email and password, and then you say yes, and it passes the test, and you say no and it fails the test. This testers very clearly sees wrong email and password. He should answer the question did you get logged in with no but he didn't. He answered it with a, yes, and that's why he got rejected by Rainforest like with most things of trying to pass human stuff that's like incredibly obvious to everyone in this room. Nobody is going to pass that test or think that that test should have passed. If you think about trying to do that, programmatically that's where the complexity behind Rainforest comes in. The vast majority of our time is spent on this kind of stuff. Basically on figuring out how can we as fast as possible figure out which test is a good which test is a bad which results are good with results are bad so the you as the customer you never have to worry about any of that management.
Russell: Unlike a lot of users of MTurk, which are more traditional use of MTurk theirs not correct ... there is a correct answer, but there is no known correct answer. We could do a test on a website and then do test 10 minutes later and get a totally different result, and that will be valid. If you're a more traditional result ... take stuff off for a receipt or scan a business card or search for something on Google and maybe that's a bad example of body but most of those have a fixed answer. Ours does not which makes it much more complicated to get good results out of.
Fred: Just to show you a few other things around the interface. Basically, this is the list of tests. It's simple. You organize tests with tags you define which tests you want to run against which browser. You can override all these things on run time. Typically the test suite will be managed by product managers. I think something that we didn't focus on in this toolkit tool who was the workflow changes that most of our customers going through. I think it's quite important to understand and the reason is is that until now at least in our experience the product managers are the ones who is neck is on the if a shift the bug. The other once sign off on the release. They are the ones who say right this is good to go to live.
Today with the current QA tooling and processes the product managers are mostly not able to own QA directly. They mostly have to have arm’s length with a QA team or with the engineering team of something similar. 90% of Rainforest customers have the product managers the key person who own that test suite. If we drill into the check type safe test, this was the one we looked at the results for you can see a bit more complexity than the example we looked at earlier. Rainforest has a bunch of help with stuff. You can embed tests you can reuse tests arbitrarily. Theirs various different pieces of logic in that to enable you to share data with your testers. Like you see in the curly braces up there, it's a very simple syntax that enables you to basically pipe rows from a spreadsheet in front of your tester.
The standard example there would be we have a list of a thousand test logins we want each tester to be logged into a unique account so they don't get into collision issues, and that's how that works. That's what the basic test looks like, and I think the other important things to show is what the testers see when they execute one of these tests. If we just choose one of the Browsers and click Open preview this is going to show us the actual tester interface that the tester used to get paid to do jobs in Rainforest when it loads.
Russell: This is embedded directly in MTurk. When we list tasks in MTurk they allow you to put an I-Frame, and this is what we put the I-Frame in. We can totally control their experience which is pretty cool.
Fred: Nice, hopefully we'll see something. Well maybe not. Give this a few more seconds before abort. okay, it's not going to happen. What you would have seen ... you would have seen the tester instructions and then the terminal the virtual machine underneath. That's one of the key things that we found out pretty early on is that you have to decouple the browser that the tester has from the browser that you're executing the testing because of course the vast majority of the testers are using Chrome and bleeding edge setups whereas most of our customers want legacy Internet Explorer platforms tested stuff like that. looks like that is still not loading. Is there anything else that we want to show you here? I don't think so. We will be happy to take questions from anybody that's interested after the talk, but here we have some of the MTurk guy’s here if you want to chat with them. Thanks very much for your time guys.
[end of transcription]