Beyond Big Data: Why AI Requires Getting Small Data Right title image

Beyond Big Data: Why AI Requires Getting Small Data Right

The allure of capturing as much data as possible is strong. And, now that more businesses are experimenting with machine learning and AI, it’s growing stronger. When you aren’t sure what you may eventually need, might as well capture everything, right?

But having more data isn’t always better — just ask Equifax. More data also means it gets harder to manage and gain valuable insights, and leverage workable data sets to accomplish specific tasks and achieve the desired outcomes. Discussing big data in the context of AI leads us to ask some serious questions about the future of big data. For data scientists like myself, I wonder whether we need big data as much as some think. In my view, in many cases the answer is “no” and instead of going big, what we really need to be doing is thinking smaller. Here’s why.

The case for small data

Just like you can’t build a skyscraper without the proper foundation, you can’t really do big data right until you master the art of harnessing small data first.

What is small data? Think of it as any business data set that can sit on a single machine. Small data is much more manageable, and devoid of the high costs (not to mention compliance and regulatory risks) of big data, which can require a massive amount of work to manage, maintain and keep clean. Small data, even if it comes in unstructured form, can also be labeled somewhat easily. Sure, a company with as much resources like Google or Facebook might be able to perfectly label their unstructured big data too, but the reality is most companies don’t have the luxury to do the same.

That’s not to say big data doesn’t have a place. But rather, if you aren’t able to find ways to manage and leverage your small data, your efforts to “go big” will most likely be a disappointment. I would argue this is why we are still in the fairly early stages of true enterprise AI; companies are still figuring out what to even use AI for and how to ask questions of it, much less have a great pulse on the data they need (and don’t) to get the answers they want back.

So, in what cases is small data really better than big data? Here’s a personal one: At Rainforest, an on-demand QA solution, we’ve eschewed using big data for some of the most critical problems we’re solving through machine learning in favor of small data. One example is with our software tester vetting process. As background, Rainforest offers human testers via an API, bringing their talent in at the right time during testing. We wanted to know what testers we could trust and which ones to de-emphasize to our customers. So, we gathered a few thousand samples that signaled when a tester used best practices or not. Our Rainforest experts (including some engineers and product managers) then labeled those examples. This didn’t take much of anyone’s time, and those couple thousand data points turned out to be enough for us to train our machine learning algorithm in the way we needed.

More importantly, as an organization this forced us to develop and solidify our best practices for using machine learning in production. This has paved the way for us working with larger datasets more effectively throughout our business; small data gave us a much simpler entry point.

Big Data or Small Data? 

Here are some quick questions to ask yourself to determine whether big or small data is the best tool for your next machine learning or AI job:

  • Do you already have the data you need, and is it labeled? If you have multiple terabytes of data but it’s not labeled, that’s going to be pretty tricky to use (if your big data set is labelled you may indeed want to use that for your next project — but this is an idealistic, and rare, scenario). However, if all you have is an idea, before going out trying to gather big data, look around. Maybe you have a usable small dataset that is better for the job at hand. Even if it’s not labelled yet, you might be able to fix that by investing a little bit of time in exchange for a more agile solution that will get you far enough.
  • What’s your use case and what is the minimum data needed to address it? A word vector model trained on a massive Google News dataset might work, but a simple linear algebra might give you comparable performance on many real-world tasks. In the technology world, we talk a lot about having a “minimum viable product” and the same thinking applies when it comes to data. To maximize efficiencies and reduce costs, you want to use the minimum amount of data required to get the job done.
  • How advanced is your organization (really) when it comes to AI/ML? It’s important to build up your organization’s capabilities step-by-step, rather than going straight to the most difficult problems (even if they are the most exciting). If your organization is newer to experiments with machine learning, solving some basic problems with small data is likely the best place to start. Once you get some wins under your belt, you can scale from there.

The Case for Using Small Data

Big data isn’t going anywhere, but it isn’t the right path to solve every machine learning problem. Just like building elegant software, a great AI or machine learning algorithm should be about doing more with the least amount.

As it turns out, there is nothing bad about thinking small.


Originally published on insideBIGDATA. Reprinted with permission. ©insideBIGDATA, 2018. All rights reserved.

What is rainforest?

Rainforest is a unified platform for software testing. Quickly build no code QA tests that can be run with automated or crowd execution. Works across browsers, platforms, and mobile.