The modern data stack — a short intro
- by 7wData
Data has become a valuable asset in every company, as it forms the basis of predictions, personalization of services, optimization and getting insights in general. While companies have been aggregating data for decades, the tech stack has greatly evolved and is now referred to as “the modern data stack”.
The modern data stack is an architecture and a strategy that allows companies to become data driven in every decision they make. The goal of the modern data stack is to set up the tools, processes and teams to build up end-to-end data pipelines.
A data pipeline is a process where data flows from a source to a destination, typically with many intermediate steps. Historically this process was called ETL, referring to the 3 main steps of the process:
E = Extract: getting the data out of the source T = Transform: transforming the raw source data into understandable data L = Load: loading the data into a data warehouse
The first step in a data pipeline is to extract data from the source. The challenge is that there are many different types of sources. These sources include databases, log files but also business applications. Modern tools such as Stitch and Fivetran make it easy to extract data from a wide range of sources, including SaaS business applications. For the latter, the APIs of those SaaS applications are used to read the data incrementally.
For large databases, the concept of CDC is often used. CDC (change data capture) is a method where are changes that occur in a source database (inserts of new data, updates of existing data and deletions of data) are tracked and sent to a destination database (or data warehouse) to recreate the data. CDC avoids the need to make a daily or hourly full dump of the data which would take too long to import.
Data transformations can be done in different ways, for example with visual tools where users drag and drop blocks to implement a pipeline that has multiple steps. Each step is one block in the visual workflow and applies one type of transformation.
Another popular and “modern” approach is to use SQL queries to define transformations, this makes sense because SQL is a powerful language, known by many users and it’s also “declarative” meaning that users can define in a concise manner what they want to accomplish. The SQL queries are executed by the data warehouse to do the actual transformations of the data. Typically this means that data moves from one table to the other, until the final result is available in a set of golden tables. The most popular tool to implement pipelines using SQL queries is called “dbt”.
Data scientists will often use programming languages to transform the data, for example Python or R.
The third step in a classic ETL pipeline, is to load the data into tables in a data warehouse. A well known pattern to load the data is the so called star schema, which defines the structure of the tables and how information is organized in these tables.
Funny enough, ETL does not cover the entire data pipeline. An ETL pipeline stops at the data warehouse as its final destination. The data warehouse is a central location to store data, but the end goal is typically either BI tools to create dashboards (Qlik, Tableau, Power BI, Looker) or a machine learning model that uses the data from the data warehouse to make for example predictions.
More recently companies have adopted an ELT approach, switching the L and the T. This means that the data is extracted and then loaded into a central repository, but the transformation takes place at a later date, if and when it’s necessary. The switch from ETL to ELT is a result of the explosion of data which is being generated. Since the transformation step is the most complex and most costly step, it makes sense to wait and only transform the data that is actually required at a certain point in time.
ELT is considered a more modern approach compared to ELT.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More