Audit Logging and Traceability for Agentic Systems
A risk officer at a mid-size insurer asked me last month what should have been a simple question. An autonomous agent had touched a claim, the claim had paid out, and a regulator wanted to know why. She had the final action in the ledger and the customer email at the start of the chain. Between those two endpoints sat about forty minutes of agent activity nobody had captured. The trace was gone.
She did not have a security problem. She had an evidence problem. In 2026, the evidence problem is the harder of the two.
If you cannot reconstruct what the agent did, you cannot defend it, fix it, or learn from it. Audit logging for an autonomous actor is harder than for a human-driven workflow because the agent does not narrate itself. A human leaves a paper trail by habit (the ticket touched, the email sent, the field edited). An agent leaves nothing unless you instrument it to. That instrumentation is the work.
The agent does not narrate itself
In a classic API-driven workflow, the request and the response are the audit log. Add a timestamp and a user ID and you have most of what compliance asks for. The shape is shallow and honest: one input, one output, one actor.
An agent is the opposite. A single user prompt fans out into a tree of internal steps: a planning call to the model, a retrieval lookup, a tool call to a CRM, a second model call that interprets the tool output, a decision to escalate or proceed, a third tool call that moves money or sends a message. By the time the agent returns “done”, twenty internal events have happened and the only one most systems record is the last one. The other nineteen are the ones a regulator, an incident reviewer, or a downstream engineer will eventually need.
The mental model I keep coming back to is the live recording. A studio album hides the takes and the punch-ins. A live recording captures everything between the count-in and the last cymbal. You do not need it for the radio edit. You need it the day somebody claims the band played a wrong note. Agents need live recordings, on by default, archived honestly.
What to log: the agent trace, not the agent answer
A workable agent trace captures six things per step, not three.
Model version and configuration. Weights hash, system prompt hash, temperature, tool list as presented. A model that behaves correctly today behaves differently tomorrow if any of these change quietly. “We use GPT something” is not a version.
The full input as the model received it. System prompt, conversation so far, retrieved documents, tool outputs fed back in. Not the user’s original message: the actual context window the model saw at that step. Most teams skip this one and most regret it.
Intermediate reasoning, where the model exposes it. Chain-of-thought, planner output, the agent’s narration of what it intends to do next. Treat it as evidence of intent, not of fact. Reasoning traces lie sometimes; they are still the best window you have.
Every tool call and its result. Tool name, exact arguments, raw response, latency. If the agent called your CRM with the wrong customer ID, that line is your investigation.
Overrides and human interventions. When a human approved a step, when a guardrail blocked one, when a fallback fired. These are the moments the system did not run on autopilot, and usually the moments that matter.
The termination condition. Why did the agent stop? Goal achieved, step budget exhausted, guardrail triggered, timeout, human cancellation, error. This is a rising concept in the conversation I track. An agent that stops for the wrong reason (or does not stop when it should) is how you find out it looped for nine hours overnight.
Forensic replay is the actual requirement
A log you cannot replay is a log you cannot defend. The acceptance test for an agent trace is not “can we read it” but “can we re-run it step by step and reproduce the agent’s behaviour.” That is the forensic-replay requirement, and the bar EU AI Act Article 12 is quietly raising for high-risk systems through its record-keeping obligation. Logs must enable traceability across the lifecycle, which in practice means you can sit a regulator at a screen and walk them through the decision.
Replay is not perfect because models are not deterministic. You will reproduce the inputs the model saw, the tools it called, the decisions it took. That is enough for an investigation and a post-incident review. The pattern is the one the incident response runbook calls for in classic security work: snapshot first, then act. The snapshot is the agent trace.
Observability for agents is not APM with a new label
Traditional application performance monitoring asks “is the system up and fast?” Agent observability asks “is it doing the right thing, and can we prove it.” APM tells you the CRM call took 800ms. Agent observability tells you the agent called the CRM with the wrong customer ID, got a polite empty result, and confidently moved on.
The emerging standard worth tracking is the OpenTelemetry semantic conventions for GenAI, which give a common shape for spans across model providers, tool calls, and agent steps. Starting fresh, start there. With existing APM, treat the GenAI conventions as the schema layer on top of your pipeline. The point is to make every agent step a span that survives the day after the agent ran.
Retention, sampling, and the privacy-vs-evidence balance
Three operational questions decide whether the program survives contact with reality.
How long do you keep the traces. Sector-dependent. Healthcare records keep for years. Financial decision logs match the underlying transaction retention. EU AI Act high-risk systems set their own floor. Ask your DPO and your sector regulator before you ask your storage budget.
What you log at full fidelity versus what you sample. Full fidelity for high-stakes actions: anything that moves money, modifies a person’s record, sends external communication, triggers a downstream legal effect. Sampled for routine low-stakes work. The trap is treating sampling as default and full fidelity as exception; for the actions a regulator cares about, the ratio is the other way around. The NIST AI RMF Measure and Manage functions push the same logic: instrument enough to manage what matters.
Privacy versus evidence. Agent traces capture personal data because agents act on it. The same log that defends you in front of a regulator is a target for a breach. The pattern that works is the one secure logging has used for a decade: encrypt at rest, tokenise the identifying fields, separate the keys from the logs, give the audit team a controlled-access path to the joined view.
The honest position
Audit logging is the boring control. It is also the only one that converts “the agent acted correctly” into “we can prove the agent acted correctly,” which is the only sentence that matters once the question is no longer hypothetical. Most agent programs I walk into in 2026 have the planning, the tools, and the prompts in good shape and the trace in nothing like good shape. I would now treat the trace as the first thing to build, not the last. The rest of the agent earns the right to run in production once the recording is on.


