Data Isn’t ‘Truth’

Data Isn't 'Truth'

It has become perhaps the most important guiding principle of today’s world of data science: “data is truth.” The statisticians, programmers and machine learning experts that acquire and analyze the vast oceans of data that power modern society are seen as uncovering undeniable underlying “truths” about human society through the power of unbiased data and unerring algorithms. Unfortunately, data scientists themselves too often conflate their work with the search for truth and fail to ask whether the data they are analyzing can actually answer the questions they ask of it. Why can’t data scientists be more like those of the physical sciences that see not “universal truths” but rather “current consensus understanding?”

Given the sheer density of statisticians in the data sciences, it is remarkable how poorly the field adheres to statistical best practices like normalization and characterizing data before analyzing it. Programmers in the data sciences, too, tend to lack the deep numerical methods and scientific computing backgrounds of their predecessors, making them dangerously unaware of the myriad traps that await numerically-intensive codes.

Most importantly, however, somewhere along the way data science became about pursuing “truth” rather than “evidence.”

We see piles of numbers as containing indisputable facts rather than merely a given constructed reality capturing one possible interpretation.

In contrast, the hard sciences are about running experiments to collect evidence, building theories to describe that evidence and arriving at temporary consensus, together with the willingness to allow today’s understanding to be readily upended by new evidence or descriptive theories.

Most importantly, all evidence in the hard sciences is treated as suspect and tainted by the conditions of its collection, requiring triangulation and replication. This is in marked opposition to the data sciences' habit of relying on single datasets and failing to run even the most basic of characterizationtests.

In the sciences, all knowledge is accepted to be temporary, based on the limitations of experimentation, simulation and current theories. Experiments are run to gather evidence to either confirm or contradict current theories. In turn, theories are adjusted to fit the current available evidence. Experiments that appear to strongly contradict existing understanding are subjected to extensive replication until the preponderance of evidence leaves no other available conclusion but that current theory must be amended to account for this new information.

Even basic “laws” are viewed not as dogmatic undisputed truth, but rather evidentiary understanding that has withstood all attempts to refute it, but which may eventually be replaced by new knowledge.

The hard sciences are replete with disagreements, novel experiments that contradict existing theories and competing theories without an obvious winner. Yet, physicists and chemists do not speak of “truth” and “fiction,” they speak work to gather evidence on behalf or against each possible explanation.

Most importantly, the hard sciences balance available evidence gathered through experimentation with designing new experimentation to gather currently unavailable evidence with theory to explain it all.

In contrast, data science has increasingly become about making use of the easiest obtainable data, not the data that best answers the question at hand.

In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data.

Much like deep learning, the broader world of data science has been marred by its fixation on free data, rather than the best data. Look across the output of any major company’s data science division and one will find that most of their analyses are based on whatever data the company already has at hand or can obtain freely from the web or cheaply from vendors or itself.

Share it:
Share it:

[Social9_Share class=”s9-widget-wrapper”]

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

You Might Be Interested In

What is generative AI, and why is it suddenly everywhere?

12 Jan, 2023

Artificial intelligence is suddenly everywhere — or at least, that’s what it seems like to me: A few weeks ago, …

Read more

Hadoop and NoSQL drive big data boom

23 Sep, 2016

Investments in technologies such as Hadoop and NoSQL will underpin much of the growth in the big data analytics market …

Read more

How to provide single-pane visibility across the business multicloud

16 Jun, 2019

You can’t manage anything effectively without a clear line of sight to all its moving parts. That’s why, as multiclouds …

Read more

Recent Jobs

Senior Cloud Engineer (AWS, Snowflake)

Remote (United States (Nationwide))

9 May, 2024

Read More

IT Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Data Engineer

Washington D.C., DC, USA

1 May, 2024

Read More

Applications Developer

Washington D.C., DC, USA

1 May, 2024

Read More

Do You Want to Share Your Story?

Bring your insights on Data, Visualization, Innovation or Business Agility to our community. Let them learn from your experience.

Get the 3 STEPS

To Drive Analytics Adoption
And manage change

3-steps-to-drive-analytics-adoption

Get Access to Event Discounts

Switch your 7wData account from Subscriber to Event Discount Member by clicking the button below and get access to event discounts. Learn & Grow together with us in a more profitable way!

Get Access to Event Discounts

Create a 7wData account and get access to event discounts. Learn & Grow together with us in a more profitable way!

Don't miss Out!

Stay in touch and receive in depth articles, guides, news & commentary of all things data.