The allure of capturing as much data as possible is strong. And, now that more businesses are experimenting with machine learning and AI, it’s growing stronger. When you aren’t sure what you may eventually need, might as well capture everything, right?
But having more data isn’t always better — just ask Equifax. More data also means it gets harder to manage and gain valuable insights, and leverage workable data sets to accomplish specific tasks and achieve the desired outcomes. Discussing big data in the context of AI leads us to ask some serious questions about the future of big data. For data scientists like myself, I wonder whether we need big data as much as some think. In my view, in many cases the answer is “no” and instead of going big, what we really need to be doing is thinking smaller. Here’s why.
Just like you can’t build a skyscraper without the proper foundation, you can’t really do big data right until you master the art of harnessing small data first.
What is small data? Think of it as any business data set that can sit on a single machine. Small data is much more manageable, and devoid of the high costs (not to mention compliance and regulatory risks) of big data, which can require a massive amount of work to manage, maintain and keep clean. Small data, even if it comes in unstructured form, can also be labeled somewhat easily. Sure, a company with as much resources like Google or Facebook might be able to perfectly label their unstructured big data too, but the reality is most companies don’t have the luxury to do the same.
That’s not to say big data doesn’t have a place. But rather, if you aren’t able to find ways to manage and leverage your small data, your efforts to “go big” will most likely be a disappointment. I would argue this is why we are still in the fairly early stages of true enterprise AI; companies are still figuring out what to even use AI for and how to ask questions of it, much less have a great pulse on the data they need (and don’t) to get the answers they want back.
So, in what cases is small data really better than big data? Here’s a personal one: At Rainforest, an on-demand QA solution, we’ve eschewed using big data for some of the most critical problems we’re solving through machine learning in favor of small data. One example is with our software tester vetting process. As background, Rainforest offers human testers via an API, bringing their talent in at the right time during testing. We wanted to know what testers we could trust and which ones to de-emphasize to our customers. So, we gathered a few thousand samples that signaled when a tester used best practices or not. Our Rainforest experts (including some engineers and product managers) then labeled those examples. This didn’t take much of anyone’s time, and those couple thousand data points turned out to be enough for us to train our machine learning algorithm in the way we needed.
More importantly, as an organization this forced us to develop and solidify our best practices for using machine learning in production. This has paved the way for us working with larger datasets more effectively throughout our business; small data gave us a much simpler entry point.
Here are some quick questions to ask yourself to determine whether big or small data is the best tool for your next machine learning or AI job:
Big data isn’t going anywhere, but it isn’t the right path to solve every machine learning problem. Just like building elegant software, a great AI or machine learning algorithm should be about doing more with the least amount.
As it turns out, there is nothing bad about thinking small.
Originally published on insideBIGDATA. Reprinted with permission. ©insideBIGDATA, 2018. All rights reserved. https://insidebigdata.com/2018/01/08/beyond-big-data-ai-requires-getting-small-data-right/