Artificial Intelligence 2021 • By Yves Mulkers

Better Machine Learning Demands Better Data Labeling

4 min read

Apple Inc., Computer vision, Image

Curated from datanami.com →

Money can’t buy you happiness (although you can reportedly lease it for a while). It definitely cannot buy you love. And the rumor is money also cannot buy you large troves of labeled data that are ready to be plugged into your particular AI use case, much to the chagrin of former Apple product manager Ivan Lee.

“I spent hundreds of millions of dollars at Apple gathering labeled data,” Lee said. “And even with its resources, we were still using spreadsheets.”

It wasn’t much different at Yahoo. There, Lee helped the company develop the sorts of AI applications that one might expect of a Web giant. But getting the data labeled in the manner required to train the AI was, again, not a pretty sight.

“I’ve been a product manager for AI for the past decade,” the Stanford graduate told Datanami in a recent interview. “What I recognized across all these companies was AI is very powerful. But in order to make it happen, behind the scenes, how the sausage was made was we had to get a lot of training data.”

Armed with this insight, Lee founded Datasaur to develop software to automate the data labeling process. Of course, data labeling is an inherently human endeavor (at least, in the beginning of an AI project, although towards the middle or the end of a project, machine learning itself can be used to automatically label data, and synthetic data can also be generated).

Lee’s main goal with the Datasaur software was to streamline the interaction of human data labelers and to guide them through the process of creating the highest quality training data at the lowest cost. Since it targets power users who label data all day, it has created function keys that accelerate the process, among other capabilities befitting a dedicated data labeling system.

But along the way, several other goals popped up for Datasaur, including the need to remove bias. Getting multiple eyeballs on a given piece of text (for NLP use cases) or an image (for computer vision use cases) helps to alleviate that. It also provides project management capabilities to clearly spell out labeling guidelines to ensure labeling standards continue to be met over time.

The subjective nature of data labeling is one of the things that makes the discipline so fraught with pitfalls. For example, when Lee was at Apple, he was asked to come up with a way to automatically label a piece of media as family appropriate or not.

“I thought, ‘Oh this is easy. I’m just going to rip off like whatever we have for movies, like PG, PG13, R,’” he said. “I thought it would be a really simple task. And then it turns out what Apple determines is appropriate is very different from what the movie industry determines is appropriate. And then there are a lot of gray area use cases. Singapore will have very different societal views on what is and is not appropriate.”

There are no shortcuts for working through those types of questions. But there are ways to help automate some of the business processes that help companies answer them, including providing a lineage of the decisions that have gone into answering those data-labeling questions. It can be done with spreadsheets, but it’s not ideal. This is what drove Lee to create Datasaur’s software.

“You wouldn’t ask your team to build out Photoshop for your designers. You just buy Photoshop off the shelf. It’s a no-brainer,” Lee said “That’s where we want Datasaur to be. You can use any tech stack you want. You can be on Amazon or Google or what have you. But if you just need to do the data labeling, we just to be that company.”

In the beginning, computer vision was the hottest AI technique for Datasaur’s customers. But lately, NLP use cases have been hot, particularly those that rely on large transformer models, like BERT and GPT-3. The company is now starting to get traction with its offering, which is being used to label a million pieces of data per week, and is used by companies like Netflix, Zoom, and Heroku.

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.

Get the AI & data signal, daily.

Continue Reading

Yves Mulkers

Related Articles

5 Artificial Intelligence Services Every Salesperson Should Try to Boost Their Sales

Artificial Intelligence in Restaurant Business

How Generative AI Will Change All Knowledge Work