Why AI Governance Starts With Data Governance

11 min read

I have been doing this work for fifteen years, and I keep walking into the same room. Different logo over the door, different industry, sometimes a different language. Same room. There is a beautiful AI roadmap on the wall, a steering committee with the right titles, a slide that says “responsible AI” in a clean sans-serif. And somewhere down the hall, in an office nobody invited to the kickoff, a data steward is staring at a spreadsheet of three hundred datasets, half of them without an owner, asking the same question they asked me in 2014: who is responsible for this thing.

That is the disconnect. The AI governance program is being built two floors above the data governance reality. And the floor in between is load-bearing.

The questions did not change. The blast radius did.

Data governance has always asked four questions. Who owns this dataset. Where did it come from. What does it actually mean. When does it expire. These are not new questions. The Venetian merchants asked them about shipping manifests. The first relational databases asked them in the 1970s. DAMA published the DMBOK around them. They are the boring, foundational questions of the discipline.

AI did not add a question. AI made each of those four questions far more expensive to get wrong.

A model trained on unclassified data will leak classified data on demand. It does not care that the spreadsheet was supposed to stay internal; it has memorised the patterns and will hand them back to anyone who asks the right way. A model fed lineage-less training sets cannot answer the EU AI Act‘s data-governance requirements, because the auditor’s first question is “where did this data come from” and silence is not an acceptable response. A model that consumed last year’s definition of “active customer” while the rest of the business moved to a new one will make decisions a quarter behind reality, every day, at machine speed.

This is the move I am asking you to make. Stop thinking of data governance as a parallel discipline to AI governance. Start thinking of it as the foundation AI governance stands on. One is the building. The other is the slab.

The 2018 nice-to-have is the 2026 precondition

In 2018, a data governance program was a smart investment. You could ship a BI dashboard, a forecast, even a recommender, on data that was mostly right, sort of owned, and roughly documented. The cost of getting it slightly wrong was a bad chart in a quarterly review. Awkward, not existential.

In 2026, the same data feeds a model that takes actions. The recommender became an agent. The dashboard became a workflow. The forecast became a price the customer sees. The blast radius of bad data is no longer a slide deck; it is a refund queue, a regulator letter, a press cycle. The same nice-to-have program from 2018 is now the precondition for shipping the AI system at all.

I am being deliberate with that word. Precondition. Not “best practice”, not “recommended”, not “highly encouraged”. You cannot meet EU AI Act Article 10 without a working answer to where each training, validation, and testing dataset came from, how it was prepared, and what biases it carries. You cannot pass the NIST AI RMF Map function without an inventory of the data the system depends on. The regulator did not write a data governance rule and call it AI law. They wrote the law, and the law turns out to require the data governance program you were going to build “next year”.

What an AI program actually inherits from data governance

When I sit with a CIO and we lay out what the AI program needs from the data side, the list is unromantic and short.

Data Governance
(the slab)

Ownership
a named human per dataset

Lineage
where the data came from,
through what hops

Quality
what ‘fit for purpose’ means,
measured

Classification
what is sensitive,
what is allowed where

Meaning
the definition the business agreed,
not the column name

AI Governance
(the building)

7wData

Each of these is older than the AI conversation. Each of them is now load-bearing in a way it was not before.

Ownership answers “who do I call when this model misbehaves because the data drifted.” Without it, the on-call ticket bounces between three teams for two days while the model keeps making decisions.

Lineage answers “where did this training set come from, and can we prove every record was lawful to use.” Lineage is the difference between an audit response that takes a week and one that takes a quarter. We go deeper in Lineage and Provenance for AI Systems.

Quality answers “is the data fit for the decision this model is about to make.” Fit for purpose is a moving definition; the model fit it once at training time, and quality monitoring is what tells you it has stopped fitting. The full mechanics live in Data Quality for AI.

Classification answers “is it lawful to put this data through this model, in this jurisdiction, for this use.” This is the single most under-built control in the programs I see. People classify documents and forget to classify training sets.

Meaning is the one nobody talks about until it bites. The business agreed three years ago that “active customer” means transacted in the last 90 days. Marketing’s model uses 180. Finance’s report uses 365. The AI agent that decides churn outreach is now operating on a definition nobody signed off, in a context nobody reviewed.

Get the AI & data signal, daily.

335k+ subscribers read this every morning. One email, both newsletters. Unsubscribe anytime.

Why sequencing matters

The argument I am making is not “data governance is more important than AI governance.” It is “data governance comes first in time, because AI governance has nothing to stand on without it.”

Sequence matters because the cost curve is brutal. Retrofitting ownership, lineage, quality, classification, and meaning onto an AI system already in production is two to five times the work of doing it before training. I have watched a team spend six months reconstructing the provenance of a training set after an auditor asked, because nobody captured it at the time. Six months that, captured at ingestion, would have been a metadata column.

Sequence also matters culturally. The data stewards who have been asking the foundational questions for years are the people best placed to operate the AI controls. If you build the AI governance program above their heads, you lose the muscle memory of the only team that already knows how to run an inventory, chase a missing owner, and survive a quarterly review without theatre. Promote them in. Do not bypass them.

The honest sequencing for 2026

If you are starting an AI governance program now, the order that works is this.

First, audit the data your highest-risk AI use cases actually depend on. Not the data you wish they depended on. Inventory it. Find the owner or appoint one. Trace the lineage as far back as you can in two weeks; what you cannot trace, flag and freeze. Classify it against your existing scheme, or build a scheme if you do not have one. Write down the agreed business meaning of every field the model uses.

Second, wire the AI controls on top: the named owner, the audit log, the human checkpoint, the documented purpose, the four controls the pillar walks through. Each of them is cheaper and more honest when the data underneath is governed.

Third, then write the policy. The reverse order, policy first, data second, is how organisations end up with the comforting fiction of an AI program built on data nobody owns.

Data governance is the slab. AI governance is the building. You do not pour the second before the first has set.

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.