The Evolution of Data Warehousing to Modern Data Engineering
- by 7wData
The data industry has seen a great deal of evolution since the early days of traditional data warehousing. We now rely on the data engineer, as opposed to the ETL developer. DevOps has made its way into the data strategy and is a clear differentiator between data warehousing and modern data engineering.
Platforms like Spark and Python have become crucial tools for the data engineer. Algorithms are starting to play a larger part in Business Intelligence and decision making. Soon enough we will be able to extract analytics without even knowing where that data is located.
Joe: Good question and because I’m a consultant I can say it depends. I can talk about the trends of what we’re doing with our clients, which is a pretty good sample set of what’s going on in the industry and in the world.
So modern data engineering for us typically means being on the cloud. Nearly 100% of our work in 2017 has been either migrating to the cloud or building something from scratch in the cloud. If there is some legacy on-premises tech there, sometimes we extend it, but typically any new initiative where analytics is a central feature is going to be a cloud solution. Whether it’s AWS or Google or Azure or something else, is a mixed bag. It really depends on some of the features, functions, religions that is involved.
So, with that, once we have a cloud infrastructure, then typically what’s included is some kind of object data store. So if it’s AWS or S3, if it’s Google Cloud, it would be Google Storage, GCS. Azure would be what they call Microsoft Blobs. Which flavor of the cloud doesn’t really matter that much. So now we have a cloud storage, object store, and then we need some kind of queryable BI-friendly environment. So typically that is still some kind of relational MPP-type database still on the cloud. So in Google it would be BigQuery. On AWS it would be Redshift or Snowflake.
Then, the final piece is the data transformations and the orchestration. Typically what we’ve been doing is using Spark for all of that, so all of the ETL that is done, all of the movement of data from the object data store to the relational database, that would all be done using Python code or SQL code, always in Spark, and then the analytics on top of it still would be Spark, and then in the relational database some kind of newer lightweight BI tool and that’s really the infrastructure.
This all fits very neatly into what we call “the corporate data pyramid”, which we’ll get into. So that’s just the infrastructure side and then we have to build it, maintain it, and productionalize it.
In 2011, probably 20% of our business was doing this and it has increased by 10% to 20% each year, and now in 2018 it’s all of our business. But what has changed in the past year is now we’re integrating the concept of DevOps with the analytics platform. And this is one of the big differences between traditional data warehousing and modern data engineering and analytics platforms–you used to have some kind of business application and it would generate all of the data and then it would waterfall all of the data into a data warehouse and then it will do reporting for human consumption. The really big thing that’s changed in the last year, last couple of years, is now the analytics platform is pretty tightly integrated and tightly coupled with the business applications. So the business applications now depend on the analytics in order to function and in order to create the user experience based on recommendation engines or scoring, things like propensity to buy or propensity to do whatever we’re trying to measure.
So, now, we need some kind of different SLAs. And so, what’s happening is the development and the deployment of change has to be very tightly coupled with the business applications.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More