Data Governance for AI: The Foundation Under Everything Else
I have been doing data governance work for fifteen years, since before anybody wanted to put the word “AI” in front of it. The questions have not changed. Who owns this dataset. Where did it come from. Can we legally use it for this purpose. What do these fields actually mean. Are they current. Is “customer” the same customer the sales team thinks it is. Nobody asked me to be poetic about this work for a long time, and then 2024 happened and suddenly every CDO in Europe wanted a coffee.
What changed is the cost of getting the answers wrong. A wrong field in a quarterly report is embarrassing. A wrong field in a feature feeding a model that decides who gets a loan is a regulatory event. AI did not invent the discipline. It raised the stakes by about three orders of magnitude.
What you will learn
- Why data governance sits underneath risk, compliance, agentic and security (not next to them)
- The four data governance questions every AI system has to answer
- How the EU AI Act made data governance non-optional for high-risk systems
- A starter operating model that fits the AI use cases you actually have
Why this hub sits under all the others
The three hubs next to this one (AI Risk Management, AI Compliance and Regulation, Governing Agentic AI, and the one beside them, AI Security) all ride on something most people skip: you cannot govern a system whose inputs you do not govern. The model is the visible thing. The data is the load-bearing wall behind the drywall. You can hang a beautiful picture on a wall that is about to collapse; the picture will be on the floor in six months.
I keep walking into the same scene. A team has built a model, they are proud of it, the demo lands well, and the board wants to put it in front of customers. Then the legal team asks where the training data came from. Silence. Then the risk team asks what changes if the upstream feed changes shape next quarter. Silence. Then somebody from the data office asks whether the “customer” entity in the training set is the same “customer” the rest of the business uses. Long silence. None of these are AI questions. They are data governance questions that AI made loud.
This is the load-bearing-wall metaphor I keep returning to in client rooms. Risk frameworks, compliance programs, agent guardrails and security controls are the rooms upstairs. They are useful, they are visible, they are where the boardroom attention lands. Data governance is the wall in the basement holding all of it up. If the wall is rotten, the rooms above do not stand, no matter how elegant the furniture.
The four questions every AI system has to answer
The whole discipline reduces to four questions. I have rewritten them on whiteboards in three languages over fifteen years; they still hold.
Who owns it. Not who built the pipeline, who is accountable for the field being correct. The steward is the person whose phone rings when something is wrong. Without a named steward, every dataset is everyone’s problem, which means it is no one’s. For AI systems, the steward’s responsibility extends past the training set into every retrain and every drift alert.
Where did it come from. Lineage. Source system, transformations applied, consent basis if it is personal data, contractual basis if it is third-party. For traditional reporting, lineage is nice-to-have. For AI, Article 10 of the EU AI Act makes it the legal floor for high-risk systems: you must be able to describe the data, its provenance, its representativeness, its biases. No lineage, no defence.
What does it mean. Semantics. The field called revenue in the data warehouse is not necessarily the field called revenue the model was trained on. A model trained on revenue_net_returns and served revenue_gross is a quietly broken model that will pass every accuracy check on the training set and fail in ways nobody can debug in production. Master Data Management is the unglamorous discipline that keeps “customer” meaning the same thing across train, serve and report.
Can we use it for this. Purpose. GDPR called this purpose limitation; every other major data law has its own version. Data collected to ship an order cannot freely be used to train a propensity model. Data licensed from a third party for analytics may not be licensable for model training. AI multiplies this risk because models memorise; what went in can leak back out in inference, which is a new and badly-understood failure mode that the legal teams I work with are still wrapping their heads around.
What the EU AI Act actually demands
Article 10 of the EU AI Act is the clause that turned data governance from a CDO best-practice into a regulatory floor. For high-risk AI systems (the ones in the Annex III list: hiring, credit, education, critical infrastructure, certain biometric uses), providers must build training, validation and testing data sets that meet specific quality criteria. The data must be relevant, representative, free of errors as far as possible, and complete with respect to the intended purpose. The data governance practices must address design choices, collection processes, preparation operations like labelling and cleaning, the assumptions baked in, prior bias assessments, and gaps or shortcomings that have been identified.
That is not a wish-list. That is a regulator with the power to fine you several percent of global turnover telling you what your data governance evidence pack has to contain before a model goes live. Most of the teams I talk to in 2026 are reading Article 10 for the first time and realising their current data governance practice was built for reporting, not for defending a regulator’s question about why a particular slice of training data was over-represented.
The reference standards exist if you want to borrow rather than invent. ISO/IEC 8000 is the international data-quality standard, with characteristics like syntactic accuracy, completeness, and provenance defined in language a regulator will accept. The DAMA-DMBOK (Data Management Body of Knowledge, version 2) is the operating-model reference: eleven knowledge areas including Data Governance, Data Quality, Data Architecture, Master and Reference Data, that map cleanly onto the questions Article 10 asks. Neither was written with AI in mind. Both apply directly. The vocabulary the AI regulators are reaching for is the vocabulary the data management profession has been refining since the early 2000s.
What changes when the consumer is a model
Most data governance programs were built for a consumer who reads the data. A human analyst reads a number, raises an eyebrow if it looks off, and asks a question. That feedback loop is gone when the consumer is a model. The model does not raise an eyebrow. It absorbs whatever you feed it and produces a confident output that looks the same whether the input was clean or contaminated.
Four things change in practice.
Quality has to be measured continuously, not periodically. A monthly data-quality report is fine for a monthly report. A model retrained nightly on yesterday’s data needs quality signals fast enough that a contaminated batch can be stopped before the retrain runs. The deeper treatment is in Data quality for AI.
Lineage has to reach every feature. Not every dataset, every feature in the model. You will be asked to defend why a specific prediction was made, and the answer has to trace back through every transformation to the original source. Tooling has caught up with this in the last three years, but the discipline of tagging at the feature level is still rare. See Data lineage and provenance for AI training sets.
Reference data becomes a single point of failure. The customer dimension, the product hierarchy, the regional rollup. If your model joins on these and they shift definition mid-quarter, your model silently degrades. Master Data Management was a 2010s buzzword; in the AI era it is the load-bearing dependency. See Master Data Management as an AI foundation.
The pipeline becomes part of the model. The model is not just the weights; it is the weights plus the pipeline that produces the features at serve time. The same pipeline at train and serve, or you get training-serving skew, which is the most common quiet failure I see in production. Governance of the pipeline matters as much as governance of the dataset. See Governing the data pipeline that feeds your models.
A starter operating model
You do not need to rebuild your data governance practice. You need to extend the one you have to cover AI use cases on purpose. A practical sequence I run with clients:
- Inventory every dataset feeding a live or planned AI system. Most teams have a model inventory and a dataset inventory and they do not join up. Join them. One row per (dataset, model) pair, owner named, lineage status flagged.
- Tier the datasets by the tier of the model that consumes them. A high-tier model under the EU AI Act drags its input data up to high-tier governance. The tiering logic mirrors the one in AI Risk Management: blast radius and reversibility.
- Run an Article 10 readiness review on every high-tier dataset. Representativeness, bias assessment, lineage, purpose basis. Document what you have, document what is missing, fix the missing.
- Wire the steward into the model lifecycle. The dataset steward must sign off retraining, must receive drift alerts, must be on the incident response chain when the model misbehaves. This is the join that most organisations skip.
- Run quarterly reviews. The data does not stand still, the models do not stand still, the regulations do not stand still. A governance program is not a project, it is a rhythm.
This is the same operating model that built the original AI governance framework at the GRC layer, scoped down to the data tier. The two should compose, not compete. Where I see teams stumble is treating data governance as a separate program from AI governance, which puts the steward and the model risk owner in different meetings, defeats the whole point.
The freeing thing about all of this is that the work is not new. The profession has been refining these answers for two decades. The job in 2026 is to apply on purpose what was already true, to a workload that no longer forgives the cut corners.


