Apache Hadoop 2022 • By Yves Mulkers

A Birds-Eye View of a Modern Data Stack

4 min read

Apache Hadoop, BigQuery, Data Lake

Curated from streamsets.com →

The modern data stack is less like a stack and more like an ecosystem with many participants. This constellation of technologies coalesces around a few guiding principles.

The first principle of the modern data stack is complete customizability. Eschewing a one size fits all solution, the modern data stack allows for data teams to pick and choose services across each layer. This means that the modern data stack can be as simple or complicated as an organization’s requirements.

The second principle is that to be part of the modern data stack your solution must be cloud-nativeto be part of the modern data stack. The reasoning is obvious: both cost and time efficiency. On-prem systems need teams to implement, update and manage them. Scaling with on-prem systems can be very difficult due to even minor changes potentially causing outages. On the other hand, cloud tools allow for out-of-the-box functionality that quickly gets users up and running. Scaling both up and down can often happen with the push of a button. Gone is the need to hire whole teams to perform maintenance. Instead, resources and focus can be redirected towards what really matters to a business.

The third principle holds that domain experts should be the ones acting on data, i.e., the people closest to the data should transform, store, define, and bring it to the surface for analysis. This is opposed to the past models where data engineers or IT organizations would supervise and control these processes on behalf of domain experts. Known throughout the stack as thedemocratization of data, this principle weaves through each layer of the modern data stack.

In summary, theprinciples of the modern datastack are:

The concept of data mesh is the result of two competing ideologies: a centralized or decentralized approach to data warehousing.

The mesh in data mesh describes how these decentralized, separate warehouses operate together to form the data strategy of an organization.

There are several vendors that have claimed to be purveyors of data mesh, but it’s more like a data strategy than it is a product.

Cloud data warehouses represent the heart of the modern data stack. Depending on the philosophy that your organization embraces, cloud data warehouses can either represent the primary place your organization consumes its data or one of many decentralized locations your data is stored. The cloud data warehouses that everybody uses today includeSnowflake, Google BigQuery, Redshift, Databricks, and Azure Synapse.

Honorable mentions in the data stack include data lakes like S3 and HDFS. These alternative places to park your data do have their downsides, because of the difficulty in wrangling data into actionable insights. What do I mean by this? A data warehouse is like folding your clothes neatly and putting them into a dresser, where a data lake is like tossing them all into a pile in your closet and shutting the door. You save some time on the front end when it comes to putting your data (or your clothes) away, but it is a lot easier to get ready for work when everything is orderly. Data Lakehouses can provide this layer for the best of both worlds.

In the modern data stack, data engineers are moving away from legacy data integration methods like scripting. No code or little to no codedata integration toolslower the technical barrier for entry and allow non-technical domain experts to be in control of their own data.

There are a few tools in thedata integration layerthat make use of API’s to move data from one location to another. These include Fivetran, Stitch, and Airbyte. The limitation of these tools is often they only function in one direction; for example, you can move data with a connector from Salesforce to your data warehouse but not the other way around. (See the section on reverseETL).

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.

Get the AI & data signal, daily.

Continue Reading

Yves Mulkers

Related Articles

Businesses, be warned: There is ‘no guarantee’ to the free flow of data from Europe

Demystifying artificial intelligence in learning

Protecting public cloud and edge data with confidential computing