Better Machine Learning Demands Better Data Labeling

Better Machine Learning Demands Better Data Labeling

Money can’t buy you happiness (although you can reportedly lease it for a while). It definitely cannot buy you love. And the rumor is money also cannot buy you large troves of labeled data that are ready to be plugged into your particular AI use case, much to the chagrin of former Apple product manager Ivan Lee.

“I spent hundreds of millions of dollars at Apple gathering labeled data,” Lee said. “And even with its resources, we were still using spreadsheets.”

It wasn’t much different at Yahoo. There, Lee helped the company develop the sorts of AI applications that one might expect of a Web giant. But getting the data labeled in the manner required to train the AI was, again, not a pretty sight.

“I’ve been a product manager for AI for the past decade,” the Stanford graduate told Datanami in a recent interview. “What I recognized across all these companies was AI is very powerful. But in order to make it happen, behind the scenes, how the sausage was made was we had to get a lot of training data.”

Armed with this insight, Lee founded Datasaur to develop software to automate the data labeling process. Of course, data labeling is an inherently human endeavor (at least, in the beginning of an AI project, although towards the middle or the end of a project, machine learning itself can be used to automatically label data, and synthetic data can also be generated).

Lee’s main goal with the Datasaur software was to streamline the interaction of human data labelers and to guide them through the process of creating the highest quality training data at the lowest cost. Since it targets power users who label data all day, it has created function keys that accelerate the process, among other capabilities befitting a dedicated data labeling system.

But along the way, several other goals popped up for Datasaur, including the need to remove bias. Getting multiple eyeballs on a given piece of text (for NLP use cases) or an image (for computer vision use cases) helps to alleviate that. It also provides project management capabilities to clearly spell out labeling guidelines to ensure labeling standards continue to be met over time.

The subjective nature of data labeling is one of the things that makes the discipline so fraught with pitfalls. For example, when Lee was at Apple, he was asked to come up with a way to automatically label a piece of media as family appropriate or not.

“I thought, ‘Oh this is easy. I’m just going to rip off like whatever we have for movies, like PG, PG13, R,’” he said. “I thought it would be a really simple task. And then it turns out what Apple determines is appropriate is very different from what the movie industry determines is appropriate. And then there are a lot of gray area use cases. Singapore will have very different societal views on what is and is not appropriate.”

There are no shortcuts for working through those types of questions. But there are ways to help automate some of the business processes that help companies answer them, including providing a lineage of the decisions that have gone into answering those data-labeling questions. It can be done with spreadsheets, but it’s not ideal. This is what drove Lee to create Datasaur’s software.

“You wouldn’t ask your team to build out Photoshop for your designers. You just buy Photoshop off the shelf. It’s a no-brainer,” Lee said “That’s where we want Datasaur to be. You can use any tech stack you want. You can be on Amazon or Google or what have you. But if you just need to do the data labeling, we just to be that company.”

In the beginning, computer vision was the hottest AI technique for Datasaur’s customers. But lately, NLP use cases have been hot, particularly those that rely on large transformer models, like BERT and GPT-3. The company is now starting to get traction with its offering, which is being used to label a million pieces of data per week, and is used by companies like Netflix, Zoom, and Heroku.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Making sense of the metaverse through data science

5 Feb, 2022

Interest in the concept of the metaverse exploded when Facebook announced it was changing its name to Meta, demonstrating the …

Read more

What China’s Algorithm Registry Reveals about AI Governance

15 Dec, 2022

For the past year, the Chinese government has been conducting some of the earliest experiments in building regulatory tools to …

Read more

What is the Difference Between The Learning Curve of Machine Learning and Artificial Intelligence?

5 Aug, 2021

Machine Learning (ML) is about statistical patterns in the artificial data sets, while artificial intelligence (AI) is about causal patterns in …

Read more

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.