Data Isn’t ‘Truth’

by 7wData
May 10, 2019

It has become perhaps the most important guiding principle of today’s world of data science: “data is truth.” The statisticians, programmers and machine learning experts that acquire and analyze the vast oceans of data that power modern society are seen as uncovering undeniable underlying “truths” about human society through the power of unbiased data and unerring algorithms. Unfortunately, data scientists themselves too often conflate their work with the search for truth and fail to ask whether the data they are analyzing can actually answer the questions they ask of it. Why can’t data scientists be more like those of the physical sciences that see not “universal truths” but rather “current consensus understanding?”

Given the sheer density of statisticians in the data sciences, it is remarkable how poorly the field adheres to statistical best practices like normalization and characterizing data before analyzing it. Programmers in the data sciences, too, tend to lack the deep numerical methods and scientific computing backgrounds of their predecessors, making them dangerously unaware of the myriad traps that await numerically-intensive codes.

Most importantly, however, somewhere along the way data science became about pursuing “truth” rather than “evidence.”

We see piles of numbers as containing indisputable facts rather than merely a given constructed reality capturing one possible interpretation.

In contrast, the hard sciences are about running experiments to collect evidence, building theories to describe that evidence and arriving at temporary consensus, together with the willingness to allow today’s understanding to be readily upended by new evidence or descriptive theories.

Most importantly, all evidence in the hard sciences is treated as suspect and tainted by the conditions of its collection, requiring triangulation and replication. This is in marked opposition to the data sciences' habit of relying on single datasets and failing to run even the most basic of characterizationtests.

In the sciences, all knowledge is accepted to be temporary, based on the limitations of experimentation, simulation and current theories. Experiments are run to gather evidence to either confirm or contradict current theories. In turn, theories are adjusted to fit the current available evidence. Experiments that appear to strongly contradict existing understanding are subjected to extensive replication until the preponderance of evidence leaves no other available conclusion but that current theory must be amended to account for this new information.

Even basic “laws” are viewed not as dogmatic undisputed truth, but rather evidentiary understanding that has withstood all attempts to refute it, but which may eventually be replaced by new knowledge.

The hard sciences are replete with disagreements, novel experiments that contradict existing theories and competing theories without an obvious winner. Yet, physicists and chemists do not speak of “truth” and “fiction,” they speak work to gather evidence on behalf or against each possible explanation.

Most importantly, the hard sciences balance available evidence gathered through experimentation with designing new experimentation to gather currently unavailable evidence with theory to explain it all.

In contrast, data science has increasingly become about making use of the easiest obtainable data, not the data that best answers the question at hand.

In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data.

Much like deep learning, the broader world of data science has been marred by its fixation on free data, rather than the best data. Look across the output of any major company’s data science division and one will find that most of their analyses are based on whatever data the company already has at hand or can obtain freely from the web or cheaply from vendors or itself.

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Data Isn’t ‘Truth’

Leave a Reply Cancel reply

Upcoming Events

MarkLogic World | Amsterdam

Knowledge Graph — The Ultimate Center of Excellence

From Text to Value: Pairing Text Analytics and Generative AI

Bringing Data Closer to Decision Makers with Data Fabric

Categories

Tags

You Might Be Interested In

What is generative AI, and why is it suddenly everywhere?

Hadoop and NoSQL drive big data boom

How to provide single-pane visibility across the business multicloud

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

IT Engineer

Data Engineer

Applications Developer

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

Data Isn’t ‘Truth’

Leave a Reply Cancel reply

Upcoming Events

Categories

Tags

You Might Be Interested In

Recent Jobs

Do You Want to Share Your Story?

Join our community

Our Services

Company

Work With Us

Follow Us

Get the 3 STEPS

To Drive Analytics Adoption And manage change

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.

To Drive Analytics Adoption
And manage change