When data gets big: Best practices for data preparation at scale

When data gets big: Best practices for data preparation at scale

Today we work with data that has grown up in diversity, scale and complexity — this applies to not only data scientists and academic researchers, but also the rest of us. Business analysts across a spectrum of industries are asked to include larger volumes of data in their work, now pervasive due to diminishing costs of collection and storage. Answering real analytic questions that drive business value means adapting methodologies to the reality of the data at hand. For this, new data preparation tools are gaining adoption, helping business users bring their domain expertise to bear on bigger, thornier data challenges. Based on our experiences navigating these transitions, we’ll share some best practices for evolving data workflows to handle increasing data volumes.

When presented with a spreadsheet, a business analyst might visually scan all the columns and rows, filter out known, irrelevant values, and run computations to quality check the data before loading it into a BI tool for visualization. But how might you deal with a dataset that’s totally new to you? What if, instead of scrolling through one thousand rows, you were confronted with the daunting prospect of inspecting one million rows?

Unlike previous data projects, it is impractical to work with all the data at once in this scenario. Visual ballparking is no longer sufficient for assessing data quality. When data volumes exceed desktop or system hardware capabilities, each attempt by the user to edit the data can slow a data tool to a crawl. Instead, structuring, cleaning, and aggregating bigger datasets means starting with a smaller, more manageable subset of the data, which enables fast exploration, iteration, and refinement. From there, an analyst’s deep familiarity with the business questions at hand can accelerate understanding of the dataset, as well as progressively evolving the dataset to the desired end goal.   

At bigger data volumes, modifying data values one by one is impossibly time-prohibitive. Instead, it’s helpful to abstract up a level and design data transformation rules that can be systematically applied to groups of columns or rows. For example, rather than changing the value of cell C25, you might define a generalized condition identifying values in column C that contain non-alphabet symbols, and then apply an edit to remove all symbol characters in the column. This approach leverages ever more powerful compute systems’ ability to process large amounts of data once the transformations are designed.

Transformation rules are ideally designed on a relatively small subset of the data, which lets you explore the data with lightning responsiveness and get to high-level conclusions faster. Of course, at each stage of the preparation process, it’s important to pick the right subset of the data.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Ways to Manage Data in the Cloud

30 Sep, 2022

How an organization manages cloud data depends on who needs access to it, how often, from where, and for what …

Read more

The Future of Information: Analytics Everywhere

20 Sep, 2016

In addition to being available as a discrete offering, we’ve infused analytics throughout all of our OpenText suites, including the …

Read more

MIT Engineers Build LEGO-Like Reconfigurable Artificial Intelligence Chip

8 Jul, 2022

The new AI chip design is stackable and reconfigurable, for swapping out and building on existing sensors and neural network …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.