The modern data stack — a short intro

The modern data stack — a short intro

Data has become a valuable asset in every company, as it forms the basis of predictions, personalization of services, optimization and getting insights in general. While companies have been aggregating data for decades, the tech stack has greatly evolved and is now referred to as “the modern data stack”.

The modern data stack is an architecture and a strategy that allows companies to become data driven in every decision they make. The goal of the modern data stack is to set up the tools, processes and teams to build up end-to-end data pipelines.

A data pipeline is a process where data flows from a source to a destination, typically with many intermediate steps. Historically this process was called ETL, referring to the 3 main steps of the process:

E = Extract: getting the data out of the source T = Transform: transforming the raw source data into understandable data L = Load: loading the data into a data warehouse

The first step in a data pipeline is to extract data from the source. The challenge is that there are many different types of sources. These sources include databases, log files but also business applications. Modern tools such as Stitch and Fivetran make it easy to extract data from a wide range of sources, including SaaS business applications. For the latter, the APIs of those SaaS applications are used to read the data incrementally.

For large databases, the concept of CDC is often used. CDC (change data capture) is a method where are changes that occur in a source database (inserts of new data, updates of existing data and deletions of data) are tracked and sent to a destination database (or data warehouse) to recreate the data. CDC avoids the need to make a daily or hourly full dump of the data which would take too long to import.

Data transformations can be done in different ways, for example with visual tools where users drag and drop blocks to implement a pipeline that has multiple steps. Each step is one block in the visual workflow and applies one type of transformation.

Another popular and “modern” approach is to use SQL queries to define transformations, this makes sense because SQL is a powerful language, known by many users and it’s also “declarative” meaning that users can define in a concise manner what they want to accomplish. The SQL queries are executed by the data warehouse to do the actual transformations of the data. Typically this means that data moves from one table to the other, until the final result is available in a set of golden tables. The most popular tool to implement pipelines using SQL queries is called “dbt”.

Data scientists will often use programming languages to transform the data, for example Python or R.

The third step in a classic ETL pipeline, is to load the data into tables in a data warehouse. A well known pattern to load the data is the so called star schema, which defines the structure of the tables and how information is organized in these tables.

Funny enough, ETL does not cover the entire data pipeline. An ETL pipeline stops at the data warehouse as its final destination. The data warehouse is a central location to store data, but the end goal is typically either BI tools to create dashboards (Qlik, Tableau, Power BI, Looker) or a machine learning model that uses the data from the data warehouse to make for example predictions.

More recently companies have adopted an ELT approach, switching the L and the T. This means that the data is extracted and then loaded into a central repository, but the transformation takes place at a later date, if and when it’s necessary. The switch from ETL to ELT is a result of the explosion of data which is being generated. Since the transformation step is the most complex and most costly step, it makes sense to wait and only transform the data that is actually required at a certain point in time.

ELT is considered a more modern approach compared to ELT.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

How to Represent Data with Intelligent Use of the Coordinate System

5 Apr, 2016

The most widely used coordinate system to represent data is the Cartesian coordinates followed by Polar coordinates Basically, Cartesian coordinate system …

Read more

Top 8 Data Science Use Cases in Construction

29 Jun, 2019

With every article, we keep proving that data science has found broad application in numerous business areas. Now, the turn …

Read more

11 Must Read Big Data case studies in Telecom Industry

13 Jan, 2017

The ability to report directly from Hadoop is the most important thing we have achieved to date and our activities …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.