Data Preparation and Data Wrangling Best Practices – Part 1
- by 7wData
Rekha Sree is a customer success Architect, using her expertise in Data Integration, Data Warehouse and Big Data to help drive customer success at Talend. Prior to joining Talend, Rekha worked at Target Corporation India Pvt Ltd for more than a decade using her vast knowledge in building their enterprise and analytical data warehouse.
Talend Data Preparation Cloud is a self-service application that enables information workers to cut hours out of their workday by simplifying and expediting the time-consuming process of preparing data for analysis or other data-driven tasks. If you are brand new to data preparation, take some time to go through my earlier blog An Introduction to Data Preparation to get the basics and learn a little bit about how it can come in handy as a self-service data preparation tool. In this blog, I want to highlight some best practices that I’ve come across as I've worked with Talend Data Preparation. So, without further delay lets jump into the topic.
A “best practice” for naming conventions really depends on the person or organization. However, following some sort of naming convention structure each and every time makes it significantly easier for subsequent users of the data to understand what the system is doing and how to fix or extend the source code for new business needs. In my experience, the best practice is primarily to follow the naming standards agreed upon the folders. Here are a few suggestions to consider when coming up with naming conventions:
Typically, preparations and datasets are tied to a specific project. Hence the naming conventions for preparations and datasets could be set either globally at the organization level or at the project level. You should do your best to ensure that the naming conventions are strictly followed. Here are a few tips from my own experience:
Now, let's talk about context variables. Context variables are user-defined variables provided by Talend whose value can be changed at runtime. Providing the values of the context variables at runtime allows jobs to be executed in different ways with different parameters. Context variables should also follow standard naming conventions. Here are a couple more suggestions around context variables:
Folder structures are used to group items of similar categories or behavior. As this is completely related to individual needs, I recommend having folder structures defined in the project’s initial phases. The screenshot below shows an example of a folder structure that might be used in a bank. Here the folders are divided by the unit of the module. Some recommendations for folder structures are things like business modules, data sources, rules applied or intake areas.
There’s a saying that I quite like that goes, “It’s not about having a lot of data, it’s about having the right data”. Data selection is about finding the data that’s needed right now, but it should also make it easier to find data later when similar needs arise.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More