Data Quality for AI: Garbage In, Garbage Out at Machine Speed
The first time I heard “garbage in, garbage out” was in a basement, in front of a beige terminal, from a man who ran a weekly sales report. If the input file was wrong on Monday, he caught it on Wednesday when the printout looked off, fixed the source on Thursday, and reran the report on Friday. The whole loop took a week. A bad row produced a bad number on a bad page, somebody noticed, and the world recovered.
That loop is gone. A model now reads the same bad row at three in the morning, scores ten thousand applicants against it before breakfast, sends decisions into a downstream system before the data team is awake, and posts the next batch into a dashboard the executive committee opens at nine. By the time anyone notices, the bad row is not a bad row. It is a week of bad decisions, half of them already acted on. The phrase did not change. The clock did.
The six dimensions, and what each one does to a model
Data quality has been described in six classical dimensions for longer than I have been in the field. They are not academic. Each one fails in a recognisable way once a model is downstream of it.
Accuracy. The value matches reality. When accuracy slips, the model learns a world that is not the one it has to act in, and it generalises the error confidently. A credit model trained on miscoded employment fields will not be a little wrong, it will be wrong in exactly the way the miscoding biased it.
Completeness. The value is there at all. Missing data does not just cost coverage. Modern models treat absence as information. If “income” is missing more often for one neighbourhood than another, the model learns the missingness pattern and reproduces the bias even after you impute.
Consistency. The same fact says the same thing in every system. The CRM says the customer is in Belgium, the billing system says France, the support tool says BE-FR. The model picks one, the agent acts on it, and you spend a quarter reconciling tickets.
Timeliness. The value is fresh enough to be true now. A feature store that lags two days is two days of stale predictions. For a fraud model, two days is a career.
Validity. The value sits inside the rules the schema promised. A date field that quietly accepts free text, a postcode field that accepts any string, a “country” column where someone has typed “EU”. The model swallows the invalid value and produces an output that looks confident and is structurally meaningless.
Uniqueness. Each entity appears once. Duplicate customers double-weight one person’s behaviour in the training set, then double-message them in production. The model is not wrong, the data told it the person mattered twice.
The reason these six dimensions matter more for AI than they did for the Wednesday-printout report is the blast radius. A 1% accuracy slip in a hand-reviewed report is one wrong line per hundred. A 1% accuracy slip in a model that scores a million records a day is ten thousand wrong decisions a day, some of which have already triggered downstream actions before lunch.
Quality SLA per dataset, not global
The mistake I keep walking into is the global data-quality target. “We aim for 99% accuracy across all our data.” It sounds responsible. It is meaningless. A 99% accuracy target on a marketing engagement table costs little and earns little. A 99% accuracy target on the table that feeds the EU AI Act high-risk credit model is the difference between a defensible system and a regulator’s case study.
The pattern that works is the same one data governance has used for decades, now applied per dataset feeding a model:
- Name the dataset.
- Name the owner (a person, not a team).
- Name the consumer (which model, which decision, which blast radius).
- Set the quality bar per dimension, in numbers tied to that consumer.
- Set the freshness bar (how stale before it is unsafe).
- Set the action on breach (block, warn, degrade).
A fraud-feature table might carry an SLA of completeness above 99.5% and freshness under five minutes. A monthly marketing-segmentation table might live with completeness at 95% and freshness at 24 hours. Same company, same governance team, different SLAs because the consumers are different. This is the part the global-target crowd resists, because writing per-dataset SLAs is work. It is the work.
Profiling for the AI-specific concerns
Classical data profiling counts nulls, sniffs types, charts distributions. That is necessary. It is not enough for a dataset that will feed a model. Three AI-specific concerns sit on top of the classical pass.
Representativeness. Does the training set look like the population the model will see in production? If the historical loan data over-represents urban applicants and under-represents rural ones, the model will inherit that geography and call the gap a signal. Representativeness is not a number on the dataset, it is a comparison between the dataset and the live world the model is deployed into.
Label correctness. Supervised models learn the labels you give them, errors included. Label noise above a certain threshold flips the model from “learning the task” to “learning the noise pattern”. Sampling a few hundred labelled rows by hand and re-checking them with a second annotator is one of the highest-leverage hours of work in the entire pipeline, and almost nobody schedules it.
Drift baseline. The dataset today is the only fair baseline for the dataset tomorrow. If you do not profile distributions, mean, variance, category mix at the moment you train, you have nothing to compare against when production starts drifting. Drift detection without a baseline is just an alarm with no clock.
These three sit alongside model risk validation, they do not replace it. Validation asks whether the model is fit for purpose; AI-specific profiling asks whether the data is fit to train one in the first place.
The cheap controls that catch 80%
A full data-quality programme is a multi-year exercise. You do not have multi-year. The good news is that three cheap controls catch most of what causes incidents, and they fit into any pipeline that already runs.
Schema validation at the door. Every dataset arriving at the boundary of the AI system gets validated against its declared schema: column presence, types, ranges, allowed values, nullability. If it fails, it does not enter. This catches the silent shape changes that cause most of the “the model started behaving strangely on Tuesday” tickets.
Freshness checks on every input. A simple “this table was last updated at X” check, with X compared to the SLA. If the table is stale, the pipeline blocks or degrades to a documented fallback. The cost is one query per run. The catch rate is high, because stale-data incidents are common and they are the hardest to spot from output behaviour alone.
Distribution sampling against a baseline. A nightly comparison of the feature distributions against the baseline snapshot taken at training time. Not statistical purity, just a useful nudge. If the mean of feature X has shifted by more than a defined band, a human looks. This is the single cheapest drift detector that exists, and it catches the slow-drift class of failures that the other two miss.
There are more sophisticated tools, lineage-graph quality propagation, ML-based anomaly detection on quality metrics, ISO/IEC 8000 conformance audits. They earn their place once these three are running and the cheap-incidents have stopped. ISO/IEC 8000 in particular is a useful reference for the vocabulary and the conformance pattern when you are ready to formalise. Until then, ship the three controls. They are the difference between machine-speed incidents and human-speed ones, which is the difference between a recoverable Tuesday and a board-level Wednesday.


