If Big Data Is the New Crude, Data Virtualization Is the New Refinery
- by 7wData
Big data is like an abundant, expanding natural resource emerging from the modern data landscape. IoT (sensor), mobile, social, clickstream, web and open data are important contributors to the proliferation of data we’re witnessing today. Worldwide data is expected to increase tenfold by 2025—reaching a total of 163 ZB—according to a recent IDC-Seagate study.
Data is plentiful, but not necessarily useful in its raw, unrefined form. As with any natural resource, “crude” data must be refined before it can be harnessed for productive purposes, such as equipment maintenance, product innovation, competitive intelligence, marketing, data monetization and active health care. The refinement process can incorporate data exploration, preparation, correlation and contextualization, labeling and annotating, unification and integration, and application of security and governance policies. metadata is also an important component, as it serves a role in both the input and output stages of the overall data-refinement process.
The extent to which data analysis contributes to unbiased conclusions, accurate predictions and insightful decision-making is constrained by the veracity of that data. If it hasn’t been provisioned for analysis, the data may suffer from fragmentation, minimal labeling and missing information. Such characteristics can be evident in electronic health records (EHRs), which illustrate the challenges of data refinement. One hurdle to gathering and analyzing EHR data is the scarcity of proper labeling and consistent semantics.
EHRs are designed primarily to fulfill patient-care, administrative and financial needs. The multipurpose objectives of EHRs—which don’t take into account data analysis per se—can create data fragmentation, which requires rectification before the data can be provisioned for analyses such as clinical research. Another challenge to building data sets from shared patient health records is the lack of standardization in how EHRs are implemented among health-care organizations, and even within the same health-care system. For example, distinct departments (e.g., radiology, orthopedics and internal medicine) in the same hospital may employ EHRs differently to satisfy their unique data-entry requirements, documentation and ordering needs, and preferences, thereby creating data silos.
Data security and privacy can also be impediments to analyzing regulated data, such as that in EHRs. The best approach to surmounting this obstacle is applying proper security and governance during the refinement process. Companies such as Google are experimenting with federated learning in their effort to advance analytics while ensuring privacy.
Data refinement is crucial to achieving reliable outcomes from data analysis, including meaningful conclusions, accurate predictions and informed decisions. Ideally, the process of refining raw data to produce complete and meaningful information does the following:
Modern analytics relies on data from myriad fragmented data sources. Experience tells us that big data sources aren’t always amenable to replicating and relocating when the data is distributed across multiple systems. Data virtualization delivers the scale to work effectively with big data sources by offering an alternative paradigm: move processing to the data.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More