Fake Data Could Help Solve Machine Learning’s Bias Problem—if We Let It

Fake Data Could Help Solve Machine Learning’s Bias Problem—if We Let It

Data is the lifeblood of artificial intelligence, and despite estimates that the world will generate more data over the next three years than it has in the previous 30, there still isn’t enough of it to supply the booming A.I. industry.

Amazon can predict your buying habits because its algorithms are trained on the data collected from its 112 million Prime subscribers in the U.S. and the tens of millions of other people around the world who visit the site and use its other products on a regular basis. Google’s advertising business depends on predictive models fueled by the billions of internet searches it processes each day and data from the 2.5 billion devices running the Android operating system. The tech giants have carved out these massive data monopolies, and that gives them near-impenetrable advantages in the field of A.I.

So how is a small A.I. startup to train its models to compete? Data collection is a time-consuming and expensive process. What about a hospital chain that wants to harness A.I. to better diagnose diseases but can’t use its own patient data due to federal privacy laws and cybersecurity concerns? Or a credit scoring agency seeking to model risky behavior that doesn’t want to use sensitive consumer information?

The answer, increasingly, is to use synthetic data—created by A.I., for A.I. In many cases, it’s a cheaper and faster option, but it carries a risk: The techniques used to generate realistic-looking data can also exacerbate harmful biases in that data.

Synthetic data comes in many forms, from images of fake faces that are indistinguishable from real ones to statistically realistic purchasing patterns for thousands of fictional customers. Executives at multiple synthetic data companies—including established firms like GenRocket and startups such as Mostly AI, Hazy, and AI Reverie—said they’ve seen a huge growth in demand for boutique data sets over just the past two years. Companies can also turn to open-source tools like Synthea, which researchers at institutions including the U.S. Department of Veterans Affairs use to create realistic medical histories for thousands of fake patients in order to study disease patterns and treatment paths.

Executives at multiple for-profit synthetic data companies, as well as at Mitre Corp., which created Synthea, have seen an explosion of interest in their services over the past several years. With that growth, though, comes potential peril for algorithms that are increasingly used to make life-changing decisions—and increasingly shown to amplify racism, sexism, and other harmful biases in high-impact areas like facial recognition, criminality prediction, and health care decision-making. Researchers say that in many cases, training an algorithm on algorithmically generated data increases the risk that an artificial intelligence system will perpetuate harmful discrimination.

“That process of creating a synthetic data set, depending on what you’re extrapolating from and how you’re doing that, can actually exacerbate the biases,” says Deb Raji, a technology fellow at the AI Now Institute. “Synthetic data can be useful for assessment and evaluation [of algorithms], but dangerous and ultimately misleading when it comes to training [them].”

One of the most common ways to create synthetic data is with a generative adversarial network, or GAN, a method developed in 2014 whereby two neural networks are pitted against each other. First, both are trained on similar sets of real data. Then the first network, or generative model, attempts to synthesize data realistic enough that it will fool the second network, the discriminatory model, into believing the synthesized data came from the same source as the real training data. The more the two networks compete in this positive feedback loop, the better they each get at their task, resulting in a synthetic data set that can be, statistically and to the naked eye, nearly indistinguishable from the real thing.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Should You Get A Degree In Business Analytics Or Data Science

9 Aug, 2020

Business analytics and data science are often used interchangeably but are different at many levels. While both are used in …

Read more

How HR Departments Can Obtain and Use Big Data

23 Oct, 2016

Enterprises are constantly seeking new ways to use analytics across their organizations, and the human resources department is no exception. …

Read more

10 steps for creating a single view of your business

10 Apr, 2017

The modern enterprise is data-driven. The capability to quickly access and act upon information has become a key competitive advantage. …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.