# The AI Audit Trail That Survives an Inspection: capture, seal, replay

> You can't audit a model that retrains — you audit the trail. What to capture, how to seal it tamper-evident, and how to replay one decision when the auditor asks.

**Author:** André Queiroz (dezotech) · **Published:** 2026-06-01
**Source:** https://dezotech.com/en/insights/ai-audit-trail-capture-seal-replay

---

Pick one decision your AI made three months ago — a credit limit, a fraud
flag, a denied claim — and prove it. Not re-run it: prove it. The exact input,
the model version that produced it, the data it drew on, the human who signed
off. Most teams I walk into can't, and they don't find out until an auditor, a
regulator, or a lawsuit asks. By then the trail has gone cold. The
uncomfortable part of AI governance in 2026 is that the thing an inspection
asks for was never the model — it's the trail, and almost nobody builds it on
purpose.

<PullStat value="~26,000" caption="families a Dutch tax algorithm wrongly flagged as benefit fraudsters (2005–2019). A civil servant reviewed each flag — but was shown no reason for the score, so no one could reconstruct why a family was picked. The cabinet resigned over it (Amnesty International, Xenophobic Machines, 2021)." />

This isn't hypothetical. Inside the Dutch tax authority, a self-learning
algorithm spent years scoring childcare-benefit claims for fraud risk. There
was a human in the loop — a civil servant reviewed each flag — but, as Amnesty
International documented in its 2021 report *Xenophobic Machines*, that person
was shown no reason for the score. The system's inputs and calculations weren't
visible, and it adapted over time with no meaningful oversight. So when the
decisions turned out to be catastrophically wrong — roughly 26,000 families
branded fraudsters, many forced to repay benefits in full — no one could reconstruct why any
single family had been singled out. Amnesty called the result "a black hole of
accountability." The government fell on 15 January 2021. The model wasn't the
scandal. The missing trail was.

That's the lesson worth copying before you're the case study: a human in the
loop without a trail is a blind signature. The reviewer signs, but can't see
what they're signing — and neither can you, six months later, when it matters.

## You don't audit the model. You audit the trail.

Here's the shift most teams miss. You cannot inspect a modern AI model the way
an accountant inspects a ledger. A system that retrains is a moving target — the
weights that produced last Tuesday's decision may not exist today. Auditing the
math after the fact is a category error, and chasing it burns the budget you
need elsewhere.

What you *can* audit — what an inspection actually tests — is the governance
around the decision: who used what, on which data, who approved it, and whether
you can show all of it for one specific case. That's a documentation problem,
not a data-science one. It's also the good news, because a documentation problem
is one you can actually pass. The trail I build to pass it does three things,
in order: capture, seal, replay.

## Capture: write down what you can't reconstruct later

An audit-trail entry isn't a log line. It's the minimum set of facts that lets
you answer the auditor's question for one decision — captured *at the moment the
decision happens*, because you cannot rebuild it afterward if you didn't record
it then.

<Checklist heading="Every consequential AI decision should capture" items={[
  'The exact input the model received — the real payload, not a summary (redact or hash PII).',
  'The model and version that produced it — a pinned identifier, not "latest".',
  'The data sources and their version the decision drew on — provable lineage.',
  'The output, plus any score or confidence behind it.',
  'The named human who reviewed or approved it — a person, not a service account.',
  'A timestamp and a correlation ID that ties the entry back to the request.'
]} />

The Dutch reviewers had the output and a flag. They were missing every other
line. That gap is the difference between a record and a rumor — and you only
discover which one you have on the day someone asks you to prove a decision.

## Seal: a trail you can edit is not evidence

Capturing the facts isn't enough if anyone can quietly change them afterward.
Evidence has to be tamper-evident: append-only storage, each entry chained to
the hash of the one before it, so a single altered record breaks the chain and
shows it. If that sounds like exotic blockchain talk, it isn't — and the
precedent is older than most AI teams.

Since 1997, the SEC has required broker-dealers to keep electronic records in
**WORM** format — write once, read many: non-rewritable, non-erasable. In
October 2022 it modernized the rule and added an *audit-trail alternative*: a
system that lets you recreate the original record if it's later modified or
deleted. That phrase is the whole game. Regulated finance settled the "how do we
trust the log" question decades ago, and AI governance is now inheriting the
answer. You don't need to invent immutability. You need to apply a standard that
already exists to a new class of decision.

## Replay: reconstruct one decision, on demand

The test of the whole system is the replay. Pick one real decision, months old,
and walk it end to end — input, model version, data, the human who approved it,
the checkpoint where someone could have said no. If you can produce that in
minutes, you pass. If it takes a week of digging across five systems, you've
already failed; the auditor just hasn't arrived yet.

Build the trail so a replay is a *query*, not an archaeology project. That one
design constraint — "could I replay this in front of an inspector tomorrow?" —
quietly forces every other decision about what to capture and how to store it.

<InlineCTA href="/en/contact" cta="Test one replay in 30 min" heading="Can you replay your last consequential AI decision?">
Bring one real decision your system made and can't afford to get wrong. In 30
minutes I'll walk it with you — input, version, approver — and show you exactly
where the trail breaks. No slides, no pitch.
</InlineCTA>

## The frameworks that give the trail a shape

You don't have to invent the controls from scratch. Three frameworks already
define them, and they stack rather than compete — the trick is knowing which
job each one does:

| Framework | What it gives you | Where it fits |
| --- | --- | --- |
| **NIST AI RMF** (AI 100-1) | The *what & why* of AI risk — four functions run iteratively: Govern, Map, Measure, Manage. | Decide which decisions are consequential enough to need a trail. |
| **ISO/IEC 42001:2023** | The *how* — the first certifiable AI management system, on a plan-do-check-act cycle. | Turn the trail into an audited, repeatable process a third party can certify. |
| **ISACA ITAF, 5th ed.** (26 Feb 2026) | The *who checks it* — IT-audit standards updated to fold in AI/ML audit guidance, automation and "digital trust" across planning, fieldwork and reporting. | Know what an IT auditor will actually test and sample. |

They reinforce each other: the risks you map and measure under NIST become the
documented evidence ISO 42001 expects, and harmonizing the two keeps you from
rebuilding compliance from zero in every jurisdiction. One caution worth saying
out loud to a board — ISO 42001 is a *voluntary* certification you pursue for
trust; the EU AI Act is *state enforcement* that can stop you operating. Passing
one doesn't excuse the other.

<Callout heading="Capture → Seal → Replay">
The spine of every trail I build: record the facts that answer the auditor's
question at the moment of the decision; seal them so no one can rewrite history;
and make reconstructing one decision a query you run in minutes, not a project.
The frameworks tell you which decisions deserve it — the trail is what passes.
</Callout>

## The pushback you'll hear — and the answer

**"Logging every decision is too expensive."** You don't log every decision —
you log the *consequential* ones, the calls with legal or financial weight. A
hash and a handful of structured fields per decision is cheap. The expensive
path is the other one: the morning a regulator asks about a decision and you
have nothing to show but the model.

**"We have explainability tools — isn't that the trail?"** No, and conflating
them is how teams get caught. Explainability tells you why a model *tends* to
behave a certain way; the trail proves what happened in *one specific case*,
with a name attached. An auditor doesn't want a feature-importance plot. They
want the input, the version, and the approver for the decision in front of them.

**"Our cloud provider already logs everything."** Provider logs tell you a
request happened — not which decision it produced, on whose authority, against
which version of the data. Infrastructure logs are not a decision trail, and an
inspector knows the difference on sight.

## Why this is the work now

The clock is the forcing function, and it's worth getting the dates right — I
see good teams plan around the wrong one. The EU AI Act's transparency duties
(Article 50) are binding from **2 August 2026**: if a user interacts with AI,
you disclose it. The high-risk obligations (Annex III) — under the Digital
Omnibus, a provisional political agreement from May 2026, **not yet formally
adopted** — would move from **2 August 2026** to **2 December 2027.** Plan for
the August date until that adoption is real: if it goes through you've gained
runway; if it stalls, you're already compliant. Either way the penalty ceiling
is unchanged — **€15M or 3% of global turnover** — the deadline moving is room
to build the trail, not permission to skip it.

The pattern across every regime is the same, and it's the whole argument of this
piece: regulators stopped demanding a provably unbiased model and started
demanding a documented, replayable trail. That's a target you can hit.

## FAQ

<FAQ items={[
  {
    q: 'What should an AI audit trail contain?',
    a: "For each consequential decision: the exact input, the model and version that produced the output, the data sources and their version, the output and any score, the named human who approved it, and a timestamp tied to the request. Capture it at the moment of the decision — you can't reconstruct it later if you didn't record it then."
  },
  {
    q: 'How do you make an AI audit trail tamper-proof?',
    a: "Append-only storage with each entry chained to the hash of the previous one, so any later edit breaks the chain visibly. Regulated finance has required this for decades: the SEC's Rule 17a-4 mandated WORM (write once, read many) storage from 1997 and, since its October 2022 amendment, also accepts an audit-trail system that can recreate the original record if it's modified or deleted."
  },
  {
    q: 'Can you audit an AI model directly?',
    a: 'Not usefully. A model that retrains is a moving target — the weights behind a past decision may no longer exist, so inspecting the math after the fact is a category error. What you audit is the governance trail around the decision: who used what, on which data, and who approved it. NIST AI RMF and ISO/IEC 42001 are built around exactly that.'
  }
]} />

An audit isn't a test of how clever your model is. It's one question, asked
about one decision: who approved this, on what basis, and can you show me? Build
the trail that answers it and the inspection becomes paperwork instead of panic.
This is the other half of [the pre-audit
playbook](/en/insights/adopt-ai-without-failing-your-next-audit) — its first
checklist item is the immutable trail; this is how you build it. And it's where
[the shadow-AI plan](/en/insights/shadow-ai-90-day-containment-plan) is heading
the whole time: the reason to drag unsanctioned AI into the light is so it lands
on a trail you can replay.