5 ways real-time will kill data quality

 

Your business is happening in real time. Your data needs to follow suit. You need to access information while it matters, if you want to gain real-time insights.

But this move to real time is treacherous. If you are not careful, you will destroy the quality of your data in less time than it takes to start Windows. Here is why.

Running data quality controls such as data deduplication or lookups can be time consuming and resource intensive. It is tempting to skip such “cumbersome” controls, in order to accelerate the loading of the data in the target (data mart, operational BI application, etc.)

This will inevitably result not only in incomplete records, with missing elements, but also in corrupted information. It may be possible to repair incomplete records later (but at a higher cost) but duplicated records, which will by then have been modified or referenced separately, might be beyond repair.

How to avoid it: Don't compromise data integrity for speed. Ask yourself if these few seconds gained in loading time are worth the damage to your data (it's not).

Waiting for someone to review doubtful records, to manually review duplicates and select surviving records, is probably the most time-consuming aspect of data quality. Yet, there is a reason why data stewardship processes had been deployed, and data stewards appointed.

How to avoid it: Same as above -- don't compromise data integrity for speed. Remember how much it cost last time you had to do a full quality assessment and repair of your database.

Collecting transactional records too quickly after a transaction happens will result in unfinished transactions being collected. For example, you may load an order from the CRM, but because it takes several minutes for this order to be propagated to the ERP and processed there, you won't get the matching manifest -- and create a discrepancy in your reports.

How to avoid it: If you need these frequent refreshes, acknowledge the fact that data integrity will sometimes be broken. Build your reporting and analytics to account for these discrepancies.

In a typical real-time scenario, not all sources will be refreshed at the same frequency. This can be for technical reasons such data volumes or available bandwidth, or practical reasons -- for example, customer shipping addresses change less often than package tracking statuses. But these differences in velocity create inconsistencies in your target systems, which will be harder to spot than when a data point is just missing, like in the previous case.

How to avoid it: Treat real-time reports as questionable. If you spot outliers, or odd results, always have in mind that differences in data velocity can be playing tricks.

A theory at work in the world of data lakes, is to throw every record you can find in the data lake, and worry about it later. And then ("later"), to implement data quality workflows inside the data lake, cleansing "dirty" records by copying them (after enrichment and deduplication) into a "clean" state.

The concern here is that more and more users are gaining access to the data lake, which is badly suffering from a lack of metadata and documentation. Hence it is very difficult for a non-initiated party to recognize the state of the record (dirty or clean).

How to avoid it: If you absolutely need to create a data swamp full of dirty data, keep it to yourself. Don't throw your dirty data into the data lake. Only share with your unsuspecting colleagues data that is in a reasonable state of cleanliness.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Who Does What in Data Science [Infographic]

15 Nov, 2015

The vast range of careers in the Data Science industry is accompanied by an avalanche of job postings. Here is …

Read more

SAP Deepens Geospatial Data Ties with Esri

24 Jul, 2015

  At the Esri User Conference 2015, Esri and SAP announced that they are extending their existing alliance to allow …

Read more

US plans to partner with Google to curb rail crossing accidents

1 Jul, 2015

  Google has agreed to include information from the US Department of Transportation’s vast database to pinpoint every rail crossing …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.