Data preparation in machine learning: 6 key steps
- by 7wData
Getting the data right is the first step in any AI or Machine Learning project -- and it's often more time-consuming and complex than crafting the Machine Learning algorithms themselves. Advanced planning to help streamline and improve data preparation in machine learning can save considerable work down the road. It can also lead to more accurate and adaptable algorithms.
"Data preparation is the action of gathering the data you need, massaging it into a format that's computer-readable and understandable, and asking hard questions of it to check it for completeness and bias," said Eli Finkelshteyn, founder and CEO of Constructor.io, which makes an AI-driven search engine for product websites.
It's tempting to focus only on the data itself, but it's a good idea to first consider the problem you're trying to solve. That can help simplify considerations about what kind of data to gather, how to ensure it fits the intended purpose and how to transform it into the appropriate format for a specific type of algorithm.
Good data preparation can lead to more accurate and efficient algorithms, while making it easier to pivot to new analytics problems, adapt when model accuracy drifts and save data scientists and business users considerable time and effort down the line.
The importance of data preparation in machine learning "Being a great data scientist is like being a great chef," surmised Donncha Carroll, a partner at consultancy Axiom Consulting Partners. "To create an exceptional meal, you must build a detailed understanding of each ingredient and think through how they'll complement one another to produce a balanced and memorable dish. For a data scientist, this process of discovery creates the knowledge needed to understand more complex relationships, what matters and what doesn't, and how to tailor the data preparation approach necessary to lay the groundwork for a great ML model." Managers need to appreciate the ways in which data shapes machine learning application development differently compared to customary application development. "Unlike traditional rule-based programming, machine learning consists of two parts that make up the final executable algorithm -- the ML algorithm itself and the data to learn from," explained Felix Wick, corporate vice president of data science at supply chain management platform provider Blue Yonder. "But raw data are often not ready to be used in ML models. So, data preparation is at the heart of ML." Data preparation consists of several steps, which consume more time than other aspects of machine learning application development. A 2021 study by data science platform vendor Anaconda found that data scientists spend an average of 22% of their time on data preparation, which is more than the average time spent on other tasks like deploying models, model training and creating data visualizations. Although it is a time-intensive process, data scientists must pay attention to various considerations when preparing data for machine learning. Following are six key steps that are part of the process.
Data preparation for building machine learning models is a lot more than just cleaning and structuring data. In many cases, it's helpful to begin by stepping back from the data to think about the underlying problem you're trying to solve. "To build a successful ML model," Carroll advised, "you must develop a detailed understanding of the problem to inform what you do and how you do it." Start by spending time with the people that operate within the domain and have a good understanding of the problem space, synthesizing what you learn through conversations with them and using your experience to create a set of hypotheses that describes the factors and forces involved. This simple step is often skipped or underinvested in, Carroll noted, even though it can make a significant difference in deciding what data to capture.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More