How (& Why) Data Scientists and Data Engineers Should Share a Platform
- by 7wData
Sharing one platform has some obvious benefits for data Science and data Engineering teams, but technical, language and process challenges often make this a challenge. Learn how one company implemented single cloud platform for R, Python and other workloads – and some of the unexpected benefits they discovered along the way.
Attending analytic conferences really exposes the range and sophistication of analytic techniques that people with the right skills can apply to data. An example of this is the EARL Boston conference, which focuses on how to best use the R programming language to produce analytic outcomes. At the most recent conference, I led a session that was based on many conversations my industry colleagues and I have had with companies trying to accelerate their analytics programs. And a specific type of problem that tends to arise over and over again. It’s how two distinct groups – Data Engineers and Data Scientists – can work more collaboratively despite the stark differences between their skills.
These teams often work independently. The Data Engineering team typically works on a shared platform, with a toolset and associated processes that optimize their flows, while the Data Science teams tend to have their own, separate set of tools and processes and generally work locally on their laptops. This creates inefficiencies, which I heard about firsthand from the Data Scientists at the EARL conference.
Many of them described extracting data from central systems with varying degrees of pain and compliance and then spending time refactoring that data to fit the analysis they wanted to do – and then (only then)starting the analysis process. This works but it’s not the most efficient overall flow because it leads to costly duplication of efforts as multiple users may extract data and waste time doing the same refactoring or transformations of data.For these reasons, it’s not surprising that more organizations are trying to have both teams work on a single platform.
The Challenges and Benefits of a Single Platform for Data Engineering & Data Science
While the concept of a single platform is a familiar topic in data strategy discussions, the flexibility of the cloud now makes it possible, though not necessarily easy. Ideally, everyone should be able to use their own tools and a variety of languages and be supported by a common underlying data and compute platform. Some of the reasons that this is challenging are related to delivering secure access to data across a variety of teams and locations, as well as having a common governance model across the disparate set of tools and processes.
Consider this real-world example from a relatively advanced Data Science team that I work with at a large corporation. The Data Engineering team predominantly uses Python for their data wrangling processes, while the Data Science team predominantly prefers R.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More