When data gets big: Best practices for data preparation at scale
- by 7wData
Today we work with data that has grown up in diversity, scale and complexity — this applies to not only data scientists and academic researchers, but also the rest of us. Business analysts across a spectrum of industries are asked to include larger volumes of data in their work, now pervasive due to diminishing costs of collection and storage. Answering real analytic questions that drive business value means adapting methodologies to the reality of the data at hand. For this, new data preparation tools are gaining adoption, helping business users bring their domain expertise to bear on bigger, thornier data challenges. Based on our experiences navigating these transitions, we’ll share some best practices for evolving data workflows to handle increasing data volumes.
When presented with a spreadsheet, a business analyst might visually scan all the columns and rows, filter out known, irrelevant values, and run computations to quality check the data before loading it into a BI tool for visualization. But how might you deal with a dataset that’s totally new to you? What if, instead of scrolling through one thousand rows, you were confronted with the daunting prospect of inspecting one million rows?
Unlike previous data projects, it is impractical to work with all the data at once in this scenario. Visual ballparking is no longer sufficient for assessing data quality. When data volumes exceed desktop or system hardware capabilities, each attempt by the user to edit the data can slow a data tool to a crawl. Instead, structuring, cleaning, and aggregating bigger datasets means starting with a smaller, more manageable subset of the data, which enables fast exploration, iteration, and refinement. From there, an analyst’s deep familiarity with the business questions at hand can accelerate understanding of the dataset, as well as progressively evolving the dataset to the desired end goal.  Â
At bigger data volumes, modifying data values one by one is impossibly time-prohibitive. Instead, it’s helpful to abstract up a level and design data transformation rules that can be systematically applied to groups of columns or rows. For example, rather than changing the value of cell C25, you might define a generalized condition identifying values in column C that contain non-alphabet symbols, and then apply an edit to remove all symbol characters in the column. This approach leverages ever more powerful compute systems’ ability to process large amounts of data once the transformations are designed.
Transformation rules are ideally designed on a relatively small subset of the data, which lets you explore the data with lightning responsiveness and get to high-level conclusions faster. Of course, at each stage of the preparation process, it’s important to pick the right subset of the data.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More