The Modern Datawarehouse

firm foundation datawarehouse

In the beginning were Applications. Applications did a lot of repetitive work which would have to be done by humans were it not for the applications. As such corporations were very attracted to computerized applications. Applications saved tie, money and unnecessary work.

One day there were applications - Bill Inmon

People found that the applications were good because they alleviated the burden of much manual work that people were doing. Then came online applications. With online applications the corporation could do things that had never been done before, such as manage reservation systems and bank teller systems automatically.

The spread of applications

In short order there were applications everywhere.

spread of applications

Silos of information

Because of the way that applications were built, soon applications were shaped into silos of information. An application had little trouble with communicating with other applications it its own silo. Sharing data and forming a collaborative effort with other silos of information was very difficult to do.

But trying to work cooperatively with applications outside of the silo was a real problem.

applications in silos

The spider’s web environment

Another way of portraying siloed systems was as a spider’s web environment. In the spider’s web environment data is extracted and is placed in many different locations.

silos of systems

In hindsight, there were many problems with the spider’s web environment. One problem was trying to do maintenance on the spider’s web. Another problem was that the ever-expanding number of applications just kept appearing. Another problem was that adding more technology, more hardware and more consultants only made the spiders web problem worse, not better.

But the biggest problem with the spider’s web environment was that there was no integrity of data in the spider’s web. The same data element appeared in multiple places and had a different value in each place. No one knew what to believe and no one trusted any of their data.

The values of data simply could not be trusted even if they could be found in the spider’s web environment.

same data element different values

The data warehouse

Out of the miasma of the siloed, spider’s web environment came the architectural solution known as the data warehouse. The data warehouse called for the separation of operational, transaction based systems from analytical systems. Indeed, the structure and the properties of both kinds of systems were very different.

At the time that the data warehouse first appeared, the notion of a separation of systems was a radical idea. Conventional wisdom of the time dictated that the two types of processing be consolidated into a single data base. But the notion of a data warehouse contradicted that idea and met with stout resistance from the vendor and academic community.

operational transaction based systems

What is a data warehouse

So what was a data warehouse? A data warehouse was defined to be a –

  • Subject oriented
  • Integrated
  • Non- volatile
  • Time variant

collection of data in support of management’s decisions.

Another way of saying what a data warehouse was is that a data warehouse is the single version of the truth.

In a word the data warehouse was a source of data that was –

  • Correct
  • Complete
  • Accurate
  • Easy to access
  • Granular, able to be reshaped by anyone who needed to use the data.

single version of the truth

Historical data

Another aspect of the data warehouse was that the data warehouse stored historical data. Prior to the data warehouse historical data was jettisoned as soon as possible from the transaction processing environment in order to enhance the performance of the transaction processing the corporation was doing. But the data warehouse became the place where historical data fit nicely.

There was a great need for looking at historical data because in order to understand customers you needed to know their history because customers are creatures of habit. And knowing the history of a customer was key to predicting future behavior of the customer.

Another feature of the data warehouse was that it was designed for doing analytical processing, not operational processing. Online transaction processing was not anything that the data warehouse supported.

hostorical data non operational processing

The relational model

Underlying the data warehouse was the relational model. The relational model fit conveniently with the analytical needs of the end user. In addition, the relational model fit well with the dbms that data warehouse existed on.

The data in the data warehouse was granular. The granularity of the data in the data warehouse allowed the data to be reshaped for the purposes of analytical processing. The data in the data warehouse was like grains of sand. Sand could be heated and remade into many different forms. Like silicon, the grains of sand sand could be made into semiconductors, eyeglasses, body parts, and so forth.

relational model

Parochial forms of data

Very quickly it was discovered that there needed to be parochial views of data coming from the data warehouse. Different organizations needed to look at the data in the data warehouse in different ways. From the data warehouse emanated data marts. The data marts organized data the way that different organizations needed to shape the data. Marketing had its data mart. Sales had its data mart. Finance had its data mart, and so forth.

Typically the data marts were built with the dimensional model of data and were built into star joins.

data marts from the datawarehouse

The need for integrated data

One of the essences of the data warehouse was that data needed to be integrated once the data was put into the data warehouse. Indeed, if data were not integrated upon being placed into the data warehouse, then there was no data warehouse. The net result of the Integration of data was the creation of enterprise wide data.

Integration needed to occur at many different levels. One level was at the semantic level. At the semantic level, data needed to have consistency of definition and name. In one place gender was designated as m,f. In another place gender was designated as male, female. In yet another place, gender was designated as 1,0. When data was placed into the data warehouse there needed to be a single designation of gender. And where gender was not designated as the enterprise norm, it needed to be converted.

The problem with integration was that integration required a lot of work and necessarily dealt with complexity. Integration was like planting tomatoes in your back yard in the springtime. You can’t plant tomatoes without getting your hands dirty. And you can’t do integration without getting your hands dirty. And consultants and vendors hated getting their hands dirty.

The design of the data warehouse and the formulation of the ETL that is needed to convert the data from its application foundation to its enterprise state is governed by a data model. It is possible to do this transformation without a data model. But it is strongly advised to use the guidance that is afforded by a data model.

data model used to control migration of data

That was yesterday

Those characteristics were what data warehouse was at the outset.

But a new day has dawned. The world is much more diverse today and much more complex that it was when data warehouse was conceived. There is new technology. There is new architectures. There are new types of data.

So what is a modern data warehouse today?

a new day has dawned

Today’s world

One way that the world has changed is that systems – data and programs – are much more distributed today than they were yesterday. Data, processing, and interfaces are much more likely to be found in multiple places and multiple formats than they were in an earlier day and age.

Today, for a variety of reasons, it is accepted that the data warehouse be placed on the cloud. There is no reason why the data warehouse should not be placed on the cloud. It fits comfortably there.

processing is distributed

Another significant change today is that there are different types of data found in the corporation. Of course, there is transaction based data that is generated by the business systems of the corporation. Transaction based data has been there from the beginning and hasn’t significantly changed.

But there is also textual data. Much very important information is wrapped up in the form of text. Call center conversations. Medical records. Corporate contracts and many more types of documents come in the form of text. And management needs to consider those types of information in their decision making process. So textual data is an integral part of the modern data warehouse.

Another type of data which has become available is analog/IoT data. Today’s machines generate a huge amount of data as a result of day to day machine processing.  There is much value that can be found in this type of data. Analog/IoT belongs in a modern data warehouse as well.

So – not only is data distributed today, but there are many different kinds of data that are available today that weren’t available in the early days of data warehouse. And all of these ingredients make up the modern data warehouse.

The distribution of data and the makeup of data itself has had an effect on what should be construed as a modern data warehouse.

todays type of data

Different volumes of data

One of the interesting things about the new types of data that have become available is that there is a very different volume of data associated with each of these types of data. There is much more textual data than there is transaction based data. And there is even more analog/IoT data than there is textual data.

The difference in volumes is measured in orders of magnitude. The difference in volumes of data is not a trivial thing.

The volume of data makes a big difference in how systems are built and configured. In addition, the volumes of data that are encountered entail the expenditure of capital.

All of these factors have played into the concept of the definition of a modern data warehouse.

relative volume of data

The relevancy to business

In addition to the significant differences in the volumes of data that come with the different sources of data, another relevant issue is the percentage of the data which has business value. When it comes to transaction based data, nearly all of the transaction based data has some business value. Some transaction based data has great business value; other transaction based data has limited business value. But in general most transaction based data has some amount of business value.

When it comes to textual data, some textual data has very great business value. But there is a lot of textual data that has no business value or business relevancy. When I ask my girlfriend for a date on Saturday night, there is no business relevancy in that exchange.

For analog/IoT data, there is a small fraction of analog/IoT data that has business value. But there is a tremendous amount of analog/IoT data that has no business value or relevancy. Because of this great disparity of business value across analog/IoT data, it is necessary to distill analog/IoT data and separate out the data that has business value and the data that does not. The analog/IoT data that has business value is placed in the modern data warehouse.

relative percentage of business value of data

Ingestion and absorption of data into the datawarehouse

So how does data find its way into the modern data warehouse? Transaction data is usually loaded into the modern data warehouse through standard ETL. A lot of people try to use technology called ELT, but ELT is a poor substitute for ETL because when people use ELT they conveniently forget to do T, which does not create a data warehouse at all. Using ELT means that the consumer will ultimately be dissatisfied with what ETL produces.

People can load text into the modern data warehouse through textual ETL. Textual ETL was once a laborious, complex, and expensive task done through NLP. Today there is a commercialization of NLP called textual ETL. Unlike NLP, textual ETL is simple, fast and inexpensive.

Analog/IoT data is loaded into the modern data warehouse through distillation software. The distillation software separates the non business relevant data from the business relevant data. After distillation only the business relevant analog data finds its way into the data warehouse.

modern data warehouse

The modern data warehouse

So what does the modern data warehouse look like today? It is distributed. The modern data warehouse typically exists on multiple platforms. In a few cases the modern data warehouse still exists in a centralized form. But more often, the modern data warehouse is found in a decentralized form.

As such the modern data warehouse has more of a logical structure than a singular centralized structure. In order to control the distributed modern data warehouse, an analytical infrastructure is required.

modern datawarehouse

And what is the definition of the modern data warehouse? The modern data warehouse is –

  • Subject oriented
  • Integrated
  • Non volatile
  • Time variant

collection of data in support of management’s decisions.

If that definition sounds familiar, it ought to sound familiar. It is the same definition that was created at the beginning of data warehouse. The definition of the data warehouse has not changed despite the fact that the technology, the types of technology, and the architecture of technology that support the data warehouse have drastically changed.

Data warehouse is still data warehouse even though the technical implementation of data warehouse has changed.

modern definition of datawarehouse

When will we no longer need the data warehouse?

One question that is asked is – when will data warehouse go away? Data warehouse will go away when people stop needing believable data. When there no longer is a need for data that is accurate, complete, up to date, and easy to access is when data warehouse will go away.

Data warehouse is as essential to the making of good corporate decisions as oxygen and air are to life. When will oxygen not be needed? Oxygen will not be needed when there is no more life. But as long as there is life, oxygen and air are essential.

when will datawarehouse go away

The data lake

Recently some vendors have suggested that the analytical needs to the end user can be served by placing raw data into a data lake. What people find is that shortly the data lake turns into a data swamp or a data sewer. No one can find anything. No one knows what data means or is defined as. No one can relate one piece of data to another. The result of this confusion is that no one uses the data lake. The data lake just sits there. The data lake turns into a data swamp or a data sewer in short order.

What is needed is an analytical infrastructure in order to turn the data lake into something that is usable. When the data lake is turned into something that is useful it can be called a data lakehouse.

from data lake to data warehouse

The data warehouse as a foundation

Another issue associated with the data warehouse is the attempt of people to build all sorts of technology on top of data. Some of these technologies that build upon data are the data mesh, data marts, AI, ML, BI and others. Organizations that build these sophisticated tools on top of data warehouses enjoy great success because the data they operate on is vetted, believable data. But organizations that try to build these technologies on something other than a data warehouse build the foundation of their processing on sand. The first good wind or storm that comes up blows the technology over. Mush and loose sand do not make for a solid foundation for sophisticated technology.

firm foundation datawarehouse

Not building on a solid foundation is like the large apartment building in downtown San 

Francisco that is not built on bedrock. The 60 story high building is falling over. You don’t want to be on Market Street or Chinatwon in San Francisco when the building topples. And you sure don’t want to be in the building itself.

So the underlying success of these technologies depends on operating on a foundation of reliable data.

The Data Stack

Many years ago, in the wild west of the United States there were salesmen that sold general purpose tonics that purported to cure any ailment that a person might hive. The conditions that could be cured included cancer, rheumatism, tuberculosis, asthma, and many others. Of course, these salesmen were con artists and their general purpose tonics cured nothing.

These people were known as “snake oil” salesmen.

Today there are people that try to redefine data warehouse by selling what is called the “modern data stack.” They use the word “modern” to infer that they know better what a data warehouse is and that they can actually produce results with their magical data stack.

So how can you tell if your modern data stack salesman is selling you snake oil or the real thing? Ask your data stack salesman to prove to you that his/her data stack can do several things.

The first thing the data stack must do is to integrate data. Vendors and consultants alike avoid the integration of data as if it were the plague. But data integration is an absolutely essential ingredient of any data warehouse, not just a modern one. If your data stack cannot do integration, then it is snake oil.

A second thing to ask your data stack salesman is if the data stack can read text and turn text into a data warehouse. Text and context must be included in the modern data warehouse. Text is a very legitimate aspect of a modern data warehouse and if your data stack cannot read text and find text and context and produce a data base that can be analyzed, then you are being sold snake oil.

A third aspect of a modern data warehouse is the inclusion of analog/IoT data. In order for analog/IoT data to be included into a data warehouse it first must be distilled. In order to be useful, analog/IoT data must be read, distilled, and organized for inclusion into a modern data warehouse. It is an essential ingredient of a modern data warehouse. If your data stack salesman cannot handle analog/IoT data properly, then you are being sold snake oil.

A fourth feature of a modern data warehouse is that of handling data lineage. Data lineage is as important as any other aspect of the modern data warehouse. The analyst never knows what he/she is dealing with unless they know all about data lineage and can support it in the data stack. If your data stack salesman cannot point out where data lineage is handled and incorporated into the modern data warehouse, then you are being sold snake oil.

The problem with snake oil salesmen is that they sell people what people believe to be a solution. Eventually people find out that they have not been sold a real solution and they blame the data warehouse. But it is the vendor of snake oil and the gullible, uneducated customer who are to blame, not the modern data warehouse. And in the meantime the snake oil salesman is off selling someone else a fake tonic.

Beware of the modern data warehouse data stack salesmen. Make them prove that they really aren’t selling snake oil. Buyer beware.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Bill Inmon

Bill Inmon

CEO at Forest Rim technology

Bill Inmon is the CEO of Forest Rim technology. Forest Rim builds technology for the disambiguation of text. Forest Rim can read text and turn the raw text into a standard data base.

Latest posts by Bill Inmon (see all)

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

Is cloud computing more expensive than alternatives?

7 Nov, 2021

This question seems to come up every year. Yesterday, there was a thread on Twitter that again raised the question of whether …

Read more

5 architectural principles for building big data systems on AWS

4 Dec, 2016

On Tuesday, at the 2016 AWS re:Invent conference, Siva Raghupathy, senior manager for solutions architecture at Amazon Web Services, shared …

Read more

Using Big Data to Anticipate and Prepare for Life Disruptions

25 Jun, 2017

Big data is helping people plan for unexpected interruptions in life. Although more predictive analytics models are developed for businesses, …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.