Learn from Google’s Data Engineers: Dimensional Data Modeling is Dead
- by 7wData
My first day at Google was in 2019, and one of my biggest surprises was that I didn’t find any dimensional data marts. I was used to joining facts to dimensions and could rattle off normal forms and preach best practices of data modeling. I considered myself an expert on slowly-changing dimensions and how to apply them. Dimensional data modeling, popularized by Ralph Kimball with his 1996 book, is a method of organizing data within a data warehouse. While many benefits are preached, I believe that it exists for 3 primary reasons; optimizing compute, organizing data by topic, and optimizing storage. These foundational purposes driving the advent of dimensional modeling have evolved over time. Let’s take this moment in history to revisit why dimensional modeling exists, and how its roots meet our needs today.
In early days of computing, storage cost a premium; as much as $90,000 in 1985. At this rate, data needed to be organized in such a way that minimized the number of times the same value would be stored. To do this, pointers were used in the form of database keys, to point a unique identifier to many records of the same value. Database normalization came about to describe how de-duplicated and storage-optimized a database had become. Instead of storing a long descriptive value every time a dimension existed in the table, we could store this description just once, and connect it to the record it is meant to describe. With the fastest computer putting out 1.9 gigaflops of computational power in 1985 for a whopping $32M, compute-optimization was equally as important. For reference, some of the fastest computers today put out over 400 petaflops of compute, or 20,000 times more flops. In the early days of databases, reducing the number of operations and the average complexity of each operation could save companies millions. For one example, instead of needing to analyze long complex string values to relate records, we could assign a single integer to unique instances, and relate these integers instead of strings. In order to accomplish these optimizations, data was organized into topic-oriented data models. One common topic-oriented model was the star schema.
The benefit of this model was that the center of the star; the fact table, contained values that were physically indexed and easily retrievable. The more costly values to retrieve were stored on dimension tables and could be selectively retrieved, saving on processing cost. If new dimensions were to arise relating to the fact, new dimension tables would have to be put in place, key relationships would need to be enforced, and the process of normalization compliance would be maintained. In successful dimensional modeling, the source data tables were torn apart, distributed among many tables, and if done correctly, could be re-assembled back to the source table if necessary.
Database normalization is showing its age. The cost of 1gb of Google Cloud storage per month is just 2 cents. The returns for breaking a long and/or wide table down into a star or snowflake schema are poor.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More