5 ways real-time will kill data quality Blog

5 ways real-time will kill data quality

by 7wData
June 6, 2015

Your business is happening in real time. Your data needs to follow suit. You need to access information while it matters, if you want to gain real-time insights.

But this move to real time is treacherous. If you are not careful, you will destroy the quality of your data in less time than it takes to start Windows. Here is why.

Running data quality controls such as data deduplication or lookups can be time consuming and resource intensive. It is tempting to skip such “cumbersome” controls, in order to accelerate the loading of the data in the target (data mart, operational BI application, etc.)

This will inevitably result not only in incomplete records, with missing elements, but also in corrupted information. It may be possible to repair incomplete records later (but at a higher cost) but duplicated records, which will by then have been modified or referenced separately, might be beyond repair.

How to avoid it: Don't compromise data integrity for speed. Ask yourself if these few seconds gained in loading time are worth the damage to your data (it's not).

Waiting for someone to review doubtful records, to manually review duplicates and select surviving records, is probably the most time-consuming aspect of data quality. Yet, there is a reason why data stewardship processes had been deployed, and data stewards appointed.

How to avoid it: Same as above -- don't compromise data integrity for speed. Remember how much it cost last time you had to do a full quality assessment and repair of your database.

Collecting transactional records too quickly after a transaction happens will result in unfinished transactions being collected. For example, you may load an order from the CRM, but because it takes several minutes for this order to be propagated to the ERP and processed there, you won't get the matching manifest -- and create a discrepancy in your reports.

How to avoid it: If you need these frequent refreshes, acknowledge the fact that data integrity will sometimes be broken. Build your reporting and analytics to account for these discrepancies.

In a typical real-time scenario, not all sources will be refreshed at the same frequency. This can be for technical reasons such data volumes or available bandwidth, or practical reasons -- for example, customer shipping addresses change less often than package tracking statuses. But these differences in velocity create inconsistencies in your target systems, which will be harder to spot than when a data point is just missing, like in the previous case.

How to avoid it: Treat real-time reports as questionable. If you spot outliers, or odd results, always have in mind that differences in data velocity can be playing tricks.

A theory at work in the world of data lakes, is to throw every record you can find in the data lake, and worry about it later. And then ("later"), to implement data quality workflows inside the data lake, cleansing "dirty" records by copying them (after enrichment and deduplication) into a "clean" state.

The concern here is that more and more users are gaining access to the data lake, which is badly suffering from a lack of metadata and documentation. Hence it is very difficult for a non-initiated party to recognize the state of the record (dirty or clean).

How to avoid it: If you absolutely need to create a data swamp full of dirty data, keep it to yourself. Don't throw your dirty data into the data lake. Only share with your unsuspecting colleagues data that is in a reasonable state of cleanliness.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

5 ways real-time will kill data quality

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

Who Does What in Data Science [Infographic]

SAP Deepens Geospatial Data Ties with Esri

US plans to partner with Google to curb rail crossing accidents

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

5 ways real-time will kill data quality

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change