Governing the Data Pipeline That Feeds Your Models
Most AI failures I have watched live in the pipeline, not the model. The model is the part on the slide. The pipeline is the part nobody wants to put their name on. It is also the leaky part: the place where a schema changed at 2 a.m. and nobody noticed, where the upstream system shipped null where it used to ship a string, where a freshness lag turned a real-time score into a stale guess. The model did what it was told. The pipeline told it something different than yesterday.
I keep meeting AI programs that have a model risk function, a compliance function, and a data governance committee, and zero named owner for the pipeline that connects all three. That gap is where the incidents happen. This article is about closing it.
What governance on a pipeline actually means
People hear “pipeline governance” and picture a spreadsheet of DAGs. That is inventory, not governance. Governance on a pipeline is four things that have to be in place before the pipeline is allowed to feed a production model.
Schema contracts. A written, machine-checked promise about the shape of the data each stage produces: columns, types, nullability, allowed values, primary key. The contract is enforced at the boundary, not in a wiki. If the upstream breaks the contract, the pipeline fails closed instead of quietly passing the breakage downstream. dbt’s model contracts and Great Expectations both do a serviceable job here; the tool matters less than the discipline of declaring the shape before the data is consumed.
Freshness SLAs. A declared answer to “how stale is too stale” for every dataset the model touches. A churn model trained on data refreshed monthly can probably tolerate a day of lag. A fraud model cannot tolerate ten minutes. The SLA is per dataset, not per pipeline, and it is alerted on.
Quality gates inline. Volume, distribution, null rates, range checks, referential integrity. The gates run on every batch, not on a quarterly audit. A batch that fails a gate stops; it does not get silently replaced by yesterday’s batch and shipped under the same name.
Lineage by default. Every dataset knows what produced it and what consumes it, captured automatically by the orchestrator, not by a human filling in a form six months later. OpenLineage gives this a vendor-neutral spec; Airflow, Dagster, and Prefect emit it natively if you turn it on. Lineage is the answer to the only question that matters during an incident: what else is wrong because this is wrong. The deep dive on this is in Data Lineage for AI.
Four controls. None of them exotic. All of them missing in most of the pipelines I have audited in the last year.
Data contracts: the upstream-downstream agreement
The most useful idea to enter data engineering in the last five years is the data contract, and it is the cheapest control on this list. A contract is a written promise between the team that produces a dataset and the teams that consume it. It says: this is the shape, this is the meaning, this is the freshness, this is what you can rely on, this is what changes require notice.
Before contracts, every breaking change was a surprise. An upstream team renames a column because nothing in their world depends on the old name; downstream, a feature pipeline silently produces nulls for two weeks; downstream of that, a model retrains on bad features and ships a worse version of itself; downstream of that, a recommendation feed drops conversion by a measurable amount and nobody connects the dots back to the rename. I have walked into that exact incident three times in different sectors. The fix was the same every time: a contract, enforced at the publish boundary, that would have failed the rename instead of letting it through.
Contracts work because they put the upstream team’s name on the promise. Without one, the consuming team carries all the risk of an upstream change they cannot see coming. With one, the producing team owns the breakage they cause. That is governance at the right altitude: the cost of a change lands on the person making the change, not on three teams downstream who find out from a dashboard turning red.
Pipeline observability: the four signals to watch
You cannot govern what you cannot see, and you cannot watch a pipeline by tailing logs. Modern pipeline observability has settled on four signals, and these are the minimum I expect to see instrumented before a pipeline is allowed near a model.
Volume. How many rows arrived this batch, compared to the rolling baseline. A pipeline that usually moves a million rows and moved ten thousand has a problem the orchestrator did not raise, because technically the job succeeded. Volume is the cheapest early warning there is.
Freshness. How old is the newest row, compared to the SLA. The job ran on time means nothing if the upstream stopped writing two hours ago.
Distribution. Are the numeric ranges, categorical mixes, and null rates inside the historical envelope. A revenue column that suddenly shows ninety percent zeros is not a model bug, it is an extraction bug, and you want to know before the model retrains on it.
Schema drift. Did a column appear, disappear, change type, or quietly change meaning. The contract from the previous section is what gives this signal teeth; without it, you detect drift but cannot decide whether it is allowed.
Four signals, watched on every batch, alerted on when they cross a threshold. None of this is research. All of it is in the open-source toolkit already; you just have to turn it on and decide who gets paged.
The cheap controls that catch most failures
If I had to pick three controls that catch the majority of pipeline-rooted model failures I have seen, none of them are sophisticated.
A row-count and null-rate alert on every input table. Five lines of SQL, a scheduled check, a Slack channel. This single control would have caught most of the silent-corruption incidents in my last twelve months of audits.
A schema-diff check between yesterday and today. Block the pipeline if the schema changed and was not approved. The approval can be one engineer typing yes. The point is that no schema change reaches the model without a human pausing for thirty seconds.
A freshness alarm tied to the SLA per dataset. If the newest row is older than the SLA, page someone. Half of “the model is broken” tickets I get pulled into turn out to be “the upstream stopped writing,” and the model team learned about it from a business stakeholder rather than from their own monitoring.
Three controls. A weekend of work to wire up. They will not catch everything; they will catch most of what actually hurts. Then you go after the harder failure modes with Data Quality for AI.
Where the pipeline meets the model
Two gates, non-negotiable, sit at the boundary between the pipeline and the model.
Validation before training. Before a training run is allowed to consume a dataset, the dataset passes the contract, the quality gates, and a freshness check. If any of those fail, the training run does not start. This is how you prevent the “we retrained on six weeks of half-broken data” incident that costs a quarter to recover from. It is also a compliance ask: the EU AI Act’s data-governance article wants evidence that training data met your declared quality bar, and you cannot produce that evidence retroactively.
Validation before inference. Before each batch (or each real-time call, depending on the system) is scored, the same checks run on the input. Drift detection sits here too. If today’s input distribution does not look like the training distribution, the model is being asked to extrapolate, and you want that flagged before the decision lands in a downstream system. This gate is also where the model risk function and the data pipeline function have to actually talk to each other, because the failure can originate on either side and the symptoms look identical from the outside.
These two gates are the difference between a model that fails loudly when its inputs change and a model that fails silently, producing worse decisions for weeks before anyone reads the postmortem.


