What is big data? Real-time analytics of disparate data at web scale
- by 7wData
Every day human beings eat, sleep, work, play, and produce data—lots and lots of data. According to IBM, the human race generates 2.5 quintillion (25 billion billion) bytes of data every day. That’s the equivalent of a stack of DVDs reaching to the moon and back, and encompasses everything from the texts we send and photos we upload to industrial sensor metrics and machine-to-machine communications.
That’s a big reason why “big data” has become such a common catch phrase. Simply put, when people talk about big data, they mean the ability to take large portions of this data, analyze it, and turn it into something useful.
But big data is much more than that. It’s about:
In the early days, the industry came up with an acronym to describe three of these four facets: VVV, for volume (the vast quantities), variety (the different kinds of data and the fact that data changes over time), and velocity (speed).
What the VVV acronym missed was the key notion that data did not need to be permanently changed (transformed) to be analyzed. That nondestructive analysis meant that organizations could both analyze the same pools of data for different purposes and could analyze data from sources gathered for different purposes.
By contrast, the data warehouse was purpose-built to analyze specific data for specific purposes, and the data was structured and converted to specific formats, with the original data essentially destroyed in the process, for that specific purpose—and no other—in what was called extract, transform, and load (ETL). Data warehousing’s ETL approach limited analysis to specific data for specific analyses. That was fine when all your data existed in your transaction systems, but not so much in today’s internet-connected world with data from everywhere.
However, don’t think for a moment that big data makes the data warehouse obsolete. Big Data systems let you work with unstructured data largely as it comes, but the type of query results you get is nowhere near the sophistication of the data warehouse. After all, the data warehouse is designed to get deep into data, and it can do that precisely because it has transformed all the data into a consistent format that lets you do things like build cubes for deep drilldown? Data warehousing vendors have spent many years optimizing their query engines to answer the queries typical of a business environment.
Big data lets you anayze much more data from more sources, but at less resolution. Thus, we will be living with both traditional data warehouses and the new style for some time to come.
To accomplish the four required facets of big data—volume, variety, nondestructive use, and speed—required several technology breakthroughs, including the development of a distributed file system (Hadoop), a method to make sense of disparate data on the fly (first Google’s MapReduce, and more recently Apache Spark), and a cloud/internet infrastructure for accessing and moving the data as needed.
Until about a dozen years ago, it wasn’t possible to manipulate more than a relatively small amount of data at any one time. (Well, we all thought our data warehouses were massive at the time. The context has shifted dramatically since then as the internet produced and connected data everywhere.) Limitations on the amount and location of data storage, computing power, and the ability to handle disparate data formats from multiple sources made the task all but impossible.
Then, sometime around 2003, researchers at Google developed MapReduce.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More