Data Isn’t ‘Truth’
- by 7wData
It has become perhaps the most important guiding principle of today’s world of data science: “data is truth.” The statisticians, programmers and machine learning experts that acquire and analyze the vast oceans of data that power modern society are seen as uncovering undeniable underlying “truths” about human society through the power of unbiased data and unerring algorithms. Unfortunately, data scientists themselves too often conflate their work with the search for truth and fail to ask whether the data they are analyzing can actually answer the questions they ask of it. Why can’t data scientists be more like those of the physical sciences that see not “universal truths” but rather “current consensus understanding?”
Given the sheer density of statisticians in the data sciences, it is remarkable how poorly the field adheres to statistical best practices like normalization and characterizing data before analyzing it. Programmers in the data sciences, too, tend to lack the deep numerical methods and scientific computing backgrounds of their predecessors, making them dangerously unaware of the myriad traps that await numerically-intensive codes.
Most importantly, however, somewhere along the way data science became about pursuing “truth” rather than “evidence.”
We see piles of numbers as containing indisputable facts rather than merely a given constructed reality capturing one possible interpretation.
In contrast, the hard sciences are about running experiments to collect evidence, building theories to describe that evidence and arriving at temporary consensus, together with the willingness to allow today’s understanding to be readily upended by new evidence or descriptive theories.
Most importantly, all evidence in the hard sciences is treated as suspect and tainted by the conditions of its collection, requiring triangulation and replication. This is in marked opposition to the data sciences' habit of relying on single datasets and failing to run even the most basic of characterizationtests.
In the sciences, all knowledge is accepted to be temporary, based on the limitations of experimentation, simulation and current theories. Experiments are run to gather evidence to either confirm or contradict current theories. In turn, theories are adjusted to fit the current available evidence. Experiments that appear to strongly contradict existing understanding are subjected to extensive replication until the preponderance of evidence leaves no other available conclusion but that current theory must be amended to account for this new information.
Even basic “laws” are viewed not as dogmatic undisputed truth, but rather evidentiary understanding that has withstood all attempts to refute it, but which may eventually be replaced by new knowledge.
The hard sciences are replete with disagreements, novel experiments that contradict existing theories and competing theories without an obvious winner. Yet, physicists and chemists do not speak of “truth” and “fiction,” they speak work to gather evidence on behalf or against each possible explanation.
Most importantly, the hard sciences balance available evidence gathered through experimentation with designing new experimentation to gather currently unavailable evidence with theory to explain it all.
In contrast, data science has increasingly become about making use of the easiest obtainable data, not the data that best answers the question at hand.
In fact, much of the bias of deep learning comes from the reliance of the AI community on free data rather than paying to create minimally biased data.
Much like deep learning, the broader world of data science has been marred by its fixation on free data, rather than the best data. Look across the output of any major company’s data science division and one will find that most of their analyses are based on whatever data the company already has at hand or can obtain freely from the web or cheaply from vendors or itself.
[Social9_Share class=”s9-widget-wrapper”]
Upcoming Events
From Text to Value: Pairing Text Analytics and Generative AI
21 May 2024
5 PM CET – 6 PM CET
Read More