Who manages data lakes and what skills are needed?
- by 7wData
Among the most common components of modern data architecture is the use of a Data Lake, which is a location where data flows in to serve as a central repository.
The concept of the Data Lake has evolved from being just a location for data collection to a more organized approach known as a data lakehouse. Whether it's called a data lake or a data lakehouse, there is a need for certain skills and IT professionals to effectively manage the technology.
A data lake is a large open storage location that typically uses object storage as a unified repository for unstructured data coming from multiple sources. Those sources can include event streaming data, operational and transactions data and databases. While data lakes can be in on-premises environments, they are more commonly created with cloud object storage services that enable large scalable data capacity, such as Amazon Simple Storage Service (S3), Google Cloud Storage or Microsoft Azure Data Lake Storage.
Data lakes first emerged to help enable big data workloads with the Apache Hadoop big data platform. A data lake architecture differs from a Data warehouse in that warehouse data is transformed into a format that provides structured data and organization.
A Data warehouse enables users to more easily query the data and use it for data analytics and business intelligence use cases. Data warehouses also provide data governance and data management capabilities.
The concept of the data lakehouse -- first coined by Databricks -- is an attempt to bring together the best of data lakes and data warehouse technologies. A data lakehouse aims to combine the ease of use and open nature of a data lake with the data warehouse's ability to easily execute queries against data.
A data lakehouse provides additional structure on top of a data lake -- often with the use of a data lake table format technology, such as Delta Lake, Apache Iceberg and Apache Hudi. It also uses a query engine technology, such as Apache Spark, Presto and Trino.
Managing data within an organization can be a multi-stakeholder effort. It can involve different job roles depending on the particular use case. Data warehouses are often managed by data warehouse managers and data warehouse analysts. Those two roles involve data management and data analytics skills, which are typically tied to a specific data warehouse vendor technology. Data lake management is often the domain of data engineers, who help design, build and maintain the data pipelines that bring data into data lakes.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More