Artificial Intelligence 2018 • By Yves Mulkers

Ethical Data Science Is Good Data Science

3 min read

Better Together (campaign), Big Data, Data anonymization

Curated from datanami.com →

There’s no doubt about it: The future will be machine driven, and central to this future are the advanced algorithms, which are fueled by the data they’re trained on. Every ad you see, every car driving itself, every medical diagnosis provided by a machine will be based on your data – and lots of it.

Without your data, we inherit a world without machine learning, and most would argue that companies without machine learning will fail. At least that’s where we’re heading; it sounds like a big problem, and it is.

Concepts around “big data” are completely incompatible with how people expect their data to be protected and how laws are shaping those protections. In fact, the GDPR, a data privacy regulation enacted in the EU, treats your data as if it’s an extension of your body. And more regulations like GDPR are coming.

One of the key tenets of the GDPR, and of the new wave of data regulations, is limiting data usage to specific purposes. The GDPR does not simply restrict how or what data is collected, it restricts how the collected data is being used.

The GDPR requirements show us a pathway that can actually translate into better data privacy protections, and ultimately, better data science. Put simply, GDPR is a manifestation of the data governance initiatives all organizations should have been doing all along. Building data governance across machine learning activities will accelerate innovation – not stifle it.

So where does this leave organizations building products based on user data – and organizations running their businesses with algorithms powered by user data? Is the Facebook fiasco the beginning of the end of data-driven initiatives? What are the lessons and steps we can take in reaction to this?

The utility of the show on the Hill is that it could start a real conversation on how to protect both the innovations driven by algorithms and consumer’s privacy when it comes to their data. Here are three steps that businesses – and the technology companies supporting them – need to take:

Just because you’re able to collect massive amounts of data does not mean that every user in an organization should be able to use and touch all aspects of that data. GDPR terms this “privacy by design”, but I term it common sense.

Should all your data scientists see all your data subjects’ Social Security number when they’re building a fraud analytic? No.

When you work with 3rd parties, where your data is “better together,” should you share it all? No.

This means enforcing fine-grained controls on your data. Not just coarse-grained role-based access control (RBAC), but down to the column and row level of your data, based on user attributes and purpose (more on that below). You need to employ techniques such as column masking, row redaction, limiting to an appropriate percentage of the data, and even better, differential privacy to ensure data anonymization.

In almost all cases, your data scientists will thank you for it.

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.

Get the AI & data signal, daily.

Continue Reading

Yves Mulkers

Related Articles

5 Free Courses for Getting Started in Artificial Intelligence

Interacting with Machine Learning – Here is Why You Should Care

Considering Serverless Architecture? What You Should Know