Data is inherently messy. Is that really such a bad thing? Blog

Data is inherently messy. Is that really such a bad thing?

by 7wData
August 11, 2018

A data quality expert once told me that vendors providing data quality software solutions should always ensure 100 percent quality data, and if they didn’t, they should be liable for any ensuing issues. I disagreed with that harsh assessment then—and still do. The truth is, sometimes 100 percent data quality isn’t necessary and could even hinder an organization’s ultimate business goals.

As much as you would like our data to be perfect and pristine, to conform to your established dimensions of data quality, it isn’t. While there’s been renewed focus in recent years on the importance of data quality for achieving higher-value data and improving machine learning, data quality is not a new problem. Tools to address data quality have existed since at least the early 1990s, and MIT held its first International Conference on Information Quality back in 1996.

After 20 to 25 years, you might expect that we would have mastered data quality! So why is 100 percent complete, clean, consistent, and accurate data still so difficult to achieve?

The answer lies in changing your mindset: Data quality is contextual, not universal. It’s time for us to accept and expect that data is messy: incomplete, nonstandard, inconsistent, inaccurate, and out of date—but that’s not necessarily a bad thing. By understanding the contexts that make data messy, you can focus your efforts on addressing data quality issues where they are most critical, and to tolerate the rest where other factors are more important—in other words, put data quality in the right place at the right time.

Not all data is created equal. We all have names—identifiers by which we are recognized. In seminars I’ve given, I’ve asked the question: “Is ‘John Doe’ good data?” Almost unanimously, the answer is no because it is considered fictitious and often used as test data. Yet “John Doe” is common and valid in health care or police investigations as the name for an unknown male (someone who does or did actually exist), in legal cases, as part of a Twitter handle for more than 100 people last I checked—not to mention there are real people with that name. The name John Doe is complete, consistent, and can be accurate. But you need to understand the context before you can say whether it is good, bad, or simply needs additional processing logic.

Numeric values and dates can be equally challenging. Just think about a rating scale from 1 to 5. Is 1 the best rating, or is 5? Or a value of 100—is that a perfect grade, a high Fahrenheit temperature, an age, or an invalid credit rating? You need context (supplied via documentation, help, policies, metadata, etc.) to understand the data correctly, and to implement the right data quality checks and rules. You must then determine whether there is a data quality issue at all, and if so, whether it’s one around which you need data quality measurements and processes.

How you incorporate data into your operations and systems is another factor impacting your consideration of data quality. Building custom applications for every organizational function is expensive. Over time, you’ve replaced many of these with software packages and even suites of systems such as enterprise resource planning (ERP) products. Each of these products, as well as your homegrown applications, have systemic requirements. Enforcing a single, consistent organization-wide standard, whether for dates (annual calendar vs. timestamp vs. Julian date), Boolean values (T/F vs. Y/N vs. 1/0), or other codes, would be quixotic at best and otherwise resource- and revenue-consuming. The same is true for third-party data, including the increasing variety of open data available.

The definitions and semantics of data impact consistency of data as well. The definition of “customer,” for example, may vary depending on whether you are in marketing, order fulfillment, or finance.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Data is inherently messy. Is that really such a bad thing?

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

AI And BI Are Vibrantly Sparking New Trends in Affiliate Marketing

3 Essential Ways To Prepare Your Business For 5G

Get quick insights from Unstructured Data

Recent Jobs

IT Engineer

Data Engineer

Applications Developer

D365 Business Analyst

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Data is inherently messy. Is that really such a bad thing?

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change