Evolution of the Modern Data Warehouse
- by 7wData
There are a lot of definitions of the data warehouse. I grabbed a random definition off the web. It fits the general understanding in the data management industry of what a data warehouse is, and what it isn’t.
If you’re looking at that definition and thinking, “That looks right to me,” then read on. Once upon a time, I probably would have agreed with this definition as well. But times have changed.
The processes and technologies of data warehousing have changed a lot in the last ten years, but as industry professionals, we often still think about data warehousing the same way. In our minds, we’re still using the same definition of a data warehouse that we used a decade ago.
So, in what ways is that definition wrong?
In the last decade, the data management field has radically transformed. Most people don’t believe that change touched the data warehouse in any way. That is a misconception.
Doug Laney’s classic three V’s hit the data management industry with a tsunami of data, and with it, a sea change in the types of analytics we can do with that data. Old school analytical databases had too much data to handle affordably without valuable data falling through the cracks. We had business demand for analysis of types of data we’d never even tried to deal with in the past, semi-structured data like JSON and Avro, log files from sensors and components, geospatial data, click stream data, and on and on. We had data coming at us too fast for the old technology to take it in, scrub it, combine it with other data sets, and provide it to the business in a useful way.
But that was only part of the problem.
Having all this data meant we could do things we’d never done before. Machine learning, data science and artificial intelligence were all fields of study back in the 90’s when I was at college, but we didn’t have the full capacity to put them to useful work back then.
The problem was also the promise.
Having tons of data in all these varieties of formats enabled new and exciting, and potentially industry-disrupting analyses. Early experiments showed machine learning could provide impressive improvements in existing systems, whole new systems for making organizations more successful, and even whole new industries and business models.
The challenge was figuring out a way to store and process all that data to get it into a good form for all those cool new types of advanced data analytics.
Hadoop to the Rescue … or Not.
Along came a cute yellow elephant with a life boat promising to store all that data affordably, and process it for us, plus give us a great platform to do fancy big data analytics like machine learning. It seemed like exactly what we needed. This was the new hotness.
Throw away your old and busted data warehouse that you’ve been running essential Business Intelligence on for decades.
What could go wrong?
Obviously, a lot of things. Putting all their important data on a giant group of reasonably priced servers along with a bunch of other data didn’t work out so well for a lot of companies. That was never the intention of the people who invented the data lake concept, but that was largely what it was used for; a dumping ground for data. It was a great place to archive years of data that wouldn’t fit in transactional systems, and didn’t seem valuable enough to put in the data warehouse itself. Great. It was all stored. Then what?
Let’s not even talk about Business Intelligence. Hadoop vendors claimed data warehouse-like capabilities, but a query engine on top of a big pile of data does not a data warehouse make. It lacked concurrency, so only a few people could use it. It lacked security, it lacked governance. Above all, it lacked the reliability and performance speed that was the hallmark of the data warehouse.
Data scientists were supposed to be the ones who could do amazing things on Hadoop.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More