Data Lineage and Provenance for AI Training Sets

12 min read

In 2018 I sat in a room with a bank’s data team and asked a question that ruined the afternoon. Where did this training dataset come from. The lead engineer pointed at a CSV on a shared drive. Who put it there. Silence. Which extract was it from. Silence. Had the customers in it consented to this use. Longer silence. Everyone in the room knew the model worked. Nobody could tell me what it had been taught with. We went home, opened a ticket, and called it a “documentation backlog.” That was the polite word for it.

That backlog is now a regulator’s question. Article 10 of the EU AI Act says training, validation, and testing data for high-risk systems must be relevant, representative, free of errors as far as possible, and have the appropriate statistical properties for the intended purpose. You cannot answer any of that if you cannot trace each dataset to its source, its consent basis, its preparation steps, and its exclusions. Lineage was an audit nicety eight years ago. It is now the precondition for putting a high-risk system on the market.

What lineage actually covers

People hear “lineage” and picture a diagram with arrows. The arrows are the easy part. The work is in what each arrow has to remember.

A useful lineage record for a training set answers four questions, end to end:

  1. Source. Where did each row come from. Which operational system, which API, which third party, which scrape, which synthetic generator. With consent basis, license, and the date the snapshot was taken.
  2. Transformations. Every join, filter, enrichment, deduplication, type coercion, imputation, and normalisation between the source and the file the model trained on. Code, not prose; the SQL or the notebook cell, version-controlled.
  3. Splits. How the data was partitioned into train, validation, and test. Random with a seed, stratified by which column, time-based with which cutoff. The split is part of the experiment; an undocumented split is an unreproducible result.
  4. Exclusions. What was deliberately removed and why. The under-eighteen rows dropped for consent reasons. The geographic regions held out because the law forbids the use. The class with too few samples. Exclusions are where bias quietly enters; they need to be visible.

If those four are wired into the pipeline rather than written up after the fact, you have lineage. If they live in someone’s head and a Confluence page from last year, you have hope.

The Datasheets for Datasets pattern

In 2018 Timnit Gebru and co-authors at Stanford and Microsoft Research published Datasheets for Datasets, a paper that borrowed an idea from the electronics industry. Every component you solder onto a circuit board ships with a datasheet: operating voltage, tolerance, failure modes, recommended use. Datasets, the authors argued, should ship with the same.

The proposal is a structured questionnaire that travels with a dataset for its whole life. Motivation (why was this collected, who funded it). Composition (what is in each instance, how was it sampled, are there protected characteristics). Collection process (how, by whom, with what consent). Preprocessing (what was cleaned, labelled, dropped). Uses (what is it appropriate for, what is it not). Distribution (who can access it, under what license). Maintenance (who fixes errors, how is it versioned).

It reads like a tedious checklist until you try to fill one out for an inherited dataset and discover you cannot answer half the questions. That moment is the point. The datasheet does not produce the answers; it surfaces the absence of them. Once a team adopts the pattern, datasets without datasheets get refused at the door, which is the cultural shift that makes the rest of governance possible.

The Datasheets pattern maps almost one to one onto the EU AI Act‘s Article 10 requirements. A team that runs datasheets honestly is most of the way to a conformity assessment on the data-governance limb of the law.

Model cards as the consumer-facing companion

Where a datasheet documents the dataset, a Model Card (Mitchell et al, Google Research, 2018) documents the model that came out of it. Same spirit, different reader. The datasheet is for the data engineer evaluating whether to use the dataset. The model card is for the person about to deploy or rely on the model.

A good model card states the intended use, the out-of-scope use, the training data summary (with a link back to the datasheet, where the chain becomes a chain), the evaluation data, the performance metrics broken down by relevant subgroups, the known limitations, and the ethical considerations. It is the artefact that lets a deployer answer the EU AI Act‘s transparency and human-oversight obligations without re-deriving them. Pair a datasheet with a model card and you have a documentation chain a regulator can follow from raw source to deployed prediction.

The two patterns are now table stakes in serious ML teams. They are not the whole governance answer, but they are the smallest unit of documentation that turns “trust me” into “read the card.”

Where lineage capture actually fails

I have walked into enough ML platforms to know the failure modes are boringly consistent.

The manual ETL gap. The pipeline orchestrator captures lineage automatically for the steps it runs. Then a data scientist downloads the output, opens a notebook, does three transformations locally, and uploads the result. Those three transformations exist nowhere but on the laptop. The lineage chain has a hole the size of a workstation.

Copy-pasted CSVs. A vendor sends a dataset by email. Someone drops it into the lake with no source metadata. Six months later it is feeding a model and nobody remembers it came from the vendor, let alone which contract permitted the use.

Renamed and re-derived datasets. The same source is pulled three times by three teams, each producing a slightly different cleaned version under a different name. The model trains on a blend; the blend has no parent.

Shadow joins. A “small enrichment” with a side dataset never gets logged, because nobody thought of the side dataset as training data. It is. Anything that influences the model’s parameters is training data; if you joined on it, it counts.

The remedy is not heroic. It is to make the captured path the easy path, and the uncaptured path the hard one. Block laptop downloads of production training data; provide a sanctioned notebook environment whose every cell is logged. Refuse datasets without source metadata at intake. Make the dataset registry the only way to reference a dataset by name in training code. Pipeline discipline, not new tooling. This is the same principle that holds in the rest of the data pipeline for AI, and it leans on the same foundations as the data quality work one step upstream.

Get the AI & data signal, daily.

335k+ subscribers read this every morning. One email, both newsletters. Unsubscribe anytime.

Tooling categories, briefly

Three categories of tool do most of the work in 2026, and they are complementary, not competitive.

Operational sources
APIs, DBs, vendors

Pipeline orchestrator
(emits run metadata)

Column-level lineage tool
(parses SQL, traces fields)

Dataset registry
(versions, datasheets,
access, retention)

Training run
(model card links back
to dataset version)

7wData

Column-level lineage tools parse SQL and pipeline code to trace which output columns came from which input columns through which transformations. They answer “if this source field changes, what breaks downstream” and the inverse, “which sources fed this training column.” Useful for impact analysis and for the technical documentation an auditor asks for.

Pipeline-orchestrator metadata. Modern orchestrators emit run metadata as a first-class output: inputs, outputs, parameters, code version, runtime. Treat that metadata as data, store it, query it. Most teams already have most of what they need here; they just do not harvest it.

Dataset registries. A catalogue of every dataset that can legally be used for training, with version, datasheet, owner, access list, retention policy, and a link to the lineage graph that produced it. The registry is what makes “use only registered datasets” an enforceable rule rather than a wish.

None of these three on its own gives you Article 10 compliance. Together, wired through the pipeline rather than bolted on the side, they do.

The honest position

Lineage in 2026 is not a documentation project. It is a precondition for shipping a high-risk system, and a precondition for trusting your own retraining decisions. Mise en place: get the ingredients labelled, weighed, and traceable before the cooking starts. The teams that wired this in quietly over the last two years are now boring to audit. The ones still treating it as a backlog will discover, the hard way, what an Article 10 inspection actually asks for.

Yves Mulkers

Yves Mulkers is the founder of 7wData and a widely followed voice in the data and AI community. He curates the 7wData and AI Beat newsletters, reaching hundreds of thousands of data and AI professionals, and writes on data strategy, analytics, AI, and the evolving data ecosystem.