Data Science Demystified: The Data Modeling Proposition

Data science is largely an enigma to the enterprise. Although there’s an array of self-service options to automate its various processes, the actual work performed by data scientists (and how it’s achieved) is still a mystery to your average business user or C-level executive.
Data modeling is the foundation of this discipline that’s responsible for the adaptive, predictive analytics that are so critical to the current data ecosystem. Before data scientists can refine cognitive computing models or build applications with them to solve specific business problems, they must rectify differences in data models to leverage different types of data for a single use case.
Since statistical Artificial Intelligence deployments like machine learning intrinsically require huge data quantities from diverse sources for optimum results, simply getting such heterogeneous data to conform to a homogenous data model has been one of the most time-honored—and time consuming—tasks in data science.
Contemporary developments in data modeling are responsible for the automation of this crucial aspect of data science. By leveraging a combination of technologies revolving around cloud computing, knowledge graphs, machine learning, and Natural Language Processing (NLP), organizations can automatically map the most variegated data to a common data model to drastically accelerate this aspect of data science—without writing code.
According to Lore IO CEO Digvijay Lamba, “By predefining the model [it’s] possible for AI to run and kind of repeatedly map disparate data into the model. In the end that leads to a significantly lower cost compared to building it in-house with data integration, data unification infrastructure.”
Most of all, leveraging pre-built, common data models drastically decreases the time spent on this data science requisite, helping shift this discipline’s focus towards tuning the models empowering statistical AI, instead of modeling their training datasets.
The cloud is indispensable to rapidly deploying a common data model for the range of schema and data types found in machine learning training data. By effectively renting a common data model via Software-as-a-Service, organizations can avail themselves of serverless computing options to further decrease their data science overhead. Once their data is replicated to a cloud object store (like an S3 bucket, for example), they can specify their requirements for mapping these data to a predefined model that rectifies differences in schema, data structure, format, and other points of distinction.
Lamba mentioned that such horizontal models are characterized by “a lot of rich depth to what the attributes are, the entities are, [and] the relationships are.” Business rules about data requirements for particular use cases are the means by which the underlying system “uses a no-code UI to automatically map the company’s data into the target model,” Lamba said.
Natural language technologies (specifically NLP and Natural Language Querying) are an integral component of rapidly mapping differentiated data elements to a common model. These approaches enable users to tailor the unified model according to their own rules to perform what Lamba termed “declarative modeling. The idea here is you describe these rules in your own language; you don’t worry about the data.” Competitive solutions in this space rely on machine learning to iteratively improve mapping that language to the various data elements—arranged in a semantic knowledge graph—pertaining to any of the model’s attributes of choice.


