Must-Know: What are common data quality issues for Big Data and how to handle them?

Must-Know: What are common data quality issues for Big Data and how to handle them?

The most common data quality issues observed when dealing with Big Data can be best understood in terms of the key characteristics of Big Data – Volume, Velocity, Variety, Veracity, and Value.

In the traditional data warehouse environment, comprehensive data quality assessment and reporting was at least possible (if not, ideal). However, in the Big Data projects the scale of data makes it impossible. Thus, the data quality measurements can at best be approximations (i.e. need to be described in probability and confidence intervals, and not in terms of absolute values). We also need to re-define most of the data quality metrics based on the specific characteristics of the Big Data project so that those metrics can have a clear meaning, be measured (good approximation) and be used for evaluating the alternative strategies for data quality improvement.

Despite the great volume of underlying data, it is not uncommon to find out that some desired data was not captured or is not available for other reasons (such as high cost, delay in getting it, etc.). It is ironical but true that data availability continues to be a prominent data quality concern in the Big Data era.

The tremendous pace of data generation and collection makes it incredibly hard to monitor data quality within a reasonable overhead on time and resources (storage, compute, human effort, etc.). So, by the time data quality assessment completes, the output might be outdated and of little use, particularly if the Big Data project is to serve any real-time or near real-time business needs. In such scenarios, you would need to re-define data quality metrics so that they are relevant as well as feasible in the real-time context.

Sampling can help you gain speed for the data quality efforts, but this comes at the cost of a bias (which eventually makes the end result less useful) because of the fact that samples are rarely an accurate representation of the entire data. Lesser samples will give higher speed, but with a bigger bias.

Another impact of velocity is that you might have to do data quality assessments on-the-fly, i.e. somewhere plugged-in within the data collection/transfer/storage processes; as the critical time-constraint does not give you the privilege of making a copy of a selected data subset, storing it elsewhere and running data quality assessments on it.

One of the biggest data quality issues in Big Data is that the data includes several data types (structured, semi-structured, and unstructured) coming in from different data sources. Thus, often a single data quality metric will not be applicable for the entire data and you would need to separately define data quality metrics for each data type. Moreover, assessing and improving the data quality of unstructured or semi-structured data is way more tricky and complex than that of structured data. For example, when mining the physician notes from medical records across the world (related to a particular medical condition) even if the language (and the grammar) is same the meaning might be very different due to local dialects and slang. This leads to low data interpretability, another data quality measure.

Data from different sources often has serious semantic differences. For example, “profit” can have widely varied definitions across the business units of an organization or external agencies. Thus, the fields with identical names may not mean the same thing. This problem is made worse by the lack of adequate and consistent meta-data from each data source. In order to make sense of data, you need reliable metadata (such as to make sense of sales numbers from a store, you need other information such as date-time, items purchased, coupons used, etc.). Usually, a lot of these data sources are outside an organization and thus, it is very hard to ensure good metadata for such data.

Another common issue is syntactic inconsistencies.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

4 Ways Big Data Is Transforming Healthcare

24 Apr, 2017

It’s hard to think of a more worthwhile use for big data than saving lives – and around the world …

Read more

6 tricky obstacles security teams face in GDPR compliance

15 Apr, 2018

The European Union’s General Data Protection Regulation (GDPR) takes effect May 25 and the penalties are stiff for failing to …

Read more

Analyzing the structure and effectiveness of news headlines using NLP

6 Nov, 2016

We wanted to gather and analyze news content in order to look for similarities and differences in the way two …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.