When data gets big: Best practices for data preparation at scale Blog

When data gets big: Best practices for data preparation at scale

by 7wData
June 8, 2017

Today we work with data that has grown up in diversity, scale and complexity — this applies to not only data scientists and academic researchers, but also the rest of us. Business analysts across a spectrum of industries are asked to include larger volumes of data in their work, now pervasive due to diminishing costs of collection and storage. Answering real analytic questions that drive business value means adapting methodologies to the reality of the data at hand. For this, new data preparation tools are gaining adoption, helping business users bring their domain expertise to bear on bigger, thornier data challenges. Based on our experiences navigating these transitions, we’ll share some best practices for evolving data workflows to handle increasing data volumes.

When presented with a spreadsheet, a business analyst might visually scan all the columns and rows, filter out known, irrelevant values, and run computations to quality check the data before loading it into a BI tool for visualization. But how might you deal with a dataset that’s totally new to you? What if, instead of scrolling through one thousand rows, you were confronted with the daunting prospect of inspecting one million rows?

Unlike previous data projects, it is impractical to work with all the data at once in this scenario. Visual ballparking is no longer sufficient for assessing data quality. When data volumes exceed desktop or system hardware capabilities, each attempt by the user to edit the data can slow a data tool to a crawl. Instead, structuring, cleaning, and aggregating bigger datasets means starting with a smaller, more manageable subset of the data, which enables fast exploration, iteration, and refinement. From there, an analyst’s deep familiarity with the business questions at hand can accelerate understanding of the dataset, as well as progressively evolving the dataset to the desired end goal.

At bigger data volumes, modifying data values one by one is impossibly time-prohibitive. Instead, it’s helpful to abstract up a level and design data transformation rules that can be systematically applied to groups of columns or rows. For example, rather than changing the value of cell C25, you might define a generalized condition identifying values in column C that contain non-alphabet symbols, and then apply an edit to remove all symbol characters in the column. This approach leverages ever more powerful compute systems’ ability to process large amounts of data once the transformations are designed.

Transformation rules are ideally designed on a relatively small subset of the data, which lets you explore the data with lightning responsiveness and get to high-level conclusions faster. Of course, at each stage of the preparation process, it’s important to pick the right subset of the data.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

When data gets big: Best practices for data preparation at scale

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

Ways to Manage Data in the Cloud

The Future of Information: Analytics Everywhere

MIT Engineers Build LEGO-Like Reconfigurable Artificial Intelligence Chip

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

When data gets big: Best practices for data preparation at scale

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change