Fake Data Could Help Solve Machine Learning’s Bias Problem—if We Let It
- by 7wData
Data is the lifeblood of artificial intelligence, and despite estimates that the world will generate more data over the next three years than it has in the previous 30, there still isn’t enough of it to supply the booming A.I. industry.
Amazon can predict your buying habits because its algorithms are trained on the data collected from its 112 million Prime subscribers in the U.S. and the tens of millions of other people around the world who visit the site and use its other products on a regular basis. Google’s advertising business depends on predictive models fueled by the billions of internet searches it processes each day and data from the 2.5 billion devices running the Android operating system. The tech giants have carved out these massive data monopolies, and that gives them near-impenetrable advantages in the field of A.I.
So how is a small A.I. startup to train its models to compete? Data collection is a time-consuming and expensive process. What about a hospital chain that wants to harness A.I. to better diagnose diseases but can’t use its own patient data due to federal privacy laws and cybersecurity concerns? Or a credit scoring agency seeking to model risky behavior that doesn’t want to use sensitive consumer information?
The answer, increasingly, is to use synthetic data—created by A.I., for A.I. In many cases, it’s a cheaper and faster option, but it carries a risk: The techniques used to generate realistic-looking data can also exacerbate harmful biases in that data.
Synthetic data comes in many forms, from images of fake faces that are indistinguishable from real ones to statistically realistic purchasing patterns for thousands of fictional customers. Executives at multiple synthetic data companies—including established firms like GenRocket and startups such as Mostly AI, Hazy, and AI Reverie—said they’ve seen a huge growth in demand for boutique data sets over just the past two years. Companies can also turn to open-source tools like Synthea, which researchers at institutions including the U.S. Department of Veterans Affairs use to create realistic medical histories for thousands of fake patients in order to study disease patterns and treatment paths.
Executives at multiple for-profit synthetic data companies, as well as at Mitre Corp., which created Synthea, have seen an explosion of interest in their services over the past several years. With that growth, though, comes potential peril for algorithms that are increasingly used to make life-changing decisions—and increasingly shown to amplify racism, sexism, and other harmful biases in high-impact areas like facial recognition, criminality prediction, and health care decision-making. Researchers say that in many cases, training an algorithm on algorithmically generated data increases the risk that an artificial intelligence system will perpetuate harmful discrimination.
“That process of creating a synthetic data set, depending on what you’re extrapolating from and how you’re doing that, can actually exacerbate the biases,” says Deb Raji, a technology fellow at the AI Now Institute. “Synthetic data can be useful for assessment and evaluation [of algorithms], but dangerous and ultimately misleading when it comes to training [them].”
One of the most common ways to create synthetic data is with a generative adversarial network, or GAN, a method developed in 2014 whereby two neural networks are pitted against each other. First, both are trained on similar sets of real data. Then the first network, or generative model, attempts to synthesize data realistic enough that it will fool the second network, the discriminatory model, into believing the synthesized data came from the same source as the real training data. The more the two networks compete in this positive feedback loop, the better they each get at their task, resulting in a synthetic data set that can be, statistically and to the naked eye, nearly indistinguishable from the real thing.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More