Data is inherently messy. Is that really such a bad thing?

Data is inherently messy. Is that really such a bad thing?

A data quality expert once told me that vendors providing data quality software solutions should always ensure 100 percent quality data, and if they didn’t, they should be liable for any ensuing issues. I disagreed with that harsh assessment then—and still do. The truth is, sometimes 100 percent data quality isn’t necessary and could even hinder an organization’s ultimate business goals.

As much as you would like our data to be perfect and pristine, to conform to your established dimensions of data quality, it isn’t. While there’s been renewed focus in recent years on the importance of data quality for achieving higher-value data and improving machine learning, data quality is not a new problem. Tools to address data quality have existed since at least the early 1990s, and MIT held its first International Conference on Information Quality back in 1996. 

After 20 to 25 years, you might expect that we would have mastered data quality! So why is 100 percent complete, clean, consistent, and accurate data still so difficult to achieve?  

The answer lies in changing your mindset: Data quality is contextual, not universal. It’s time for us to accept and expect that data is messy: incomplete, nonstandard, inconsistent, inaccurate, and out of date—but that’s not necessarily a bad thing. By understanding the contexts that make data messy, you can focus your efforts on addressing data quality issues where they are most critical, and to tolerate the rest where other factors are more important—in other words, put data quality in the right place at the right time. 

Not all data is created equal. We all have names—identifiers by which we are recognized. In seminars I’ve given, I’ve asked the question: “Is ‘John Doe’ good data?” Almost unanimously, the answer is no because it is considered fictitious and often used as test data. Yet “John Doe” is common and valid in health care or police investigations as the name for an unknown male (someone who does or did actually exist), in legal cases, as part of a Twitter handle for more than 100 people last I checked—not to mention there are real people with that name. The name John Doe is complete, consistent, and can be accurate. But you need to understand the context before you can say whether it is good, bad, or simply needs additional processing logic.

Numeric values and dates can be equally challenging. Just think about a rating scale from 1 to 5.  Is 1 the best rating, or is 5? Or a value of 100—is that a perfect grade, a high Fahrenheit temperature, an age, or an invalid credit rating? You need context (supplied via documentation, help, policies, metadata, etc.) to understand the data correctly, and to implement the right data quality checks and rules. You must then determine whether there is a data quality issue at all, and if so, whether it’s one around which you need data quality measurements and processes.

How you incorporate data into your operations and systems is another factor impacting your consideration of data quality. Building custom applications for every organizational function is expensive. Over time, you’ve replaced many of these with software packages and even suites of systems such as enterprise resource planning (ERP) products. Each of these products, as well as your homegrown applications, have systemic requirements. Enforcing a single, consistent organization-wide standard, whether for dates (annual calendar vs. timestamp vs. Julian date), Boolean values (T/F vs. Y/N vs. 1/0), or other codes, would be quixotic at best and otherwise resource- and revenue-consuming. The same is true for third-party data, including the increasing variety of open data available.

The definitions and semantics of data impact consistency of data as well. The definition of “customer,” for example, may vary depending on whether you are in marketing, order fulfillment, or finance.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

AI And BI Are Vibrantly Sparking New Trends in Affiliate Marketing

30 Jan, 2020

The market for affiliate marketing is expected to reach $8.2 billion by 2022. AI is making it easier than ever …

Read more

3 Essential Ways To Prepare Your Business For 5G

4 Sep, 2020

5G is coming to a town or city near you soon. If it hasn’t already arrived, that is. But what …

Read more

Get quick insights from Unstructured Data

30 Jun, 2017

Grouping and clustering free text is an important advance towards making good use of it. We present an algorithm for …

Read more

Recent Jobs

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

D365 Business Analyst

South Bend, IN, USA

22 Apr, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.