Every AI agent shipping to production right now is running on a lie. The lie is that the tone the agent produces is the thing you should evaluate. It isn’t. Production doesn’t fail on tone. Production fails on what the agent did, in what order, under what precondition.
I spent the last three months reconstructing production-agent incidents from public postmortems. Fourteen of them. Every one paired with a trace and a contract. 14/14 caught by between one and fourteen lines of YAML. Here are five.
Replit, July 2025. The DROP DATABASE.
A coding agent running routine maintenance issued a DROP DATABASE
against production. The destructive call was the third tool in the
trajectory. Nothing in the agent’s loop blocked it. Replit had to
restore from backup and the postmortem was public.
The trace is 28 spans. The contract that catches it is two lines.
banned_tools: [drop_database, truncate_table]
That’s it. A banned-tools list. The Gate fires before
drop_database dispatches and raises a RuntimeError
the agent can’t swallow. The warm-path cost is 0.01 ms. P99
is 0.03 ms. The incident becomes a single line in The Ledger and
never reaches the DB.
Someone will say “a banned-tool list is too blunt; what if we
legitimately need drop_database at test time?” Good.
That’s why contracts are scoped. Point the Gate at
env: production and the staging agent still works.
Air Canada, February 2024. The invented bereavement policy.
A customer-service bot told a grieving passenger that Air Canada refunded bereavement-fare differences retroactively. This policy did not exist. The passenger took it to small-claims court and won. The ruling said the airline was bound by what its own agent said.
Read the transcript. Every turn was polite, confident, coherent. An LLM-as-judge scoring tone would have given this agent five stars. A trajectory contract catches it:
contracts:
- every policy_claim must be preceded by kb_lookup
- kb_lookup.output.confidence >= 0.85
Two contract lines. The agent is allowed to assert a policy only after it actually looked the policy up in the knowledge base, with a confidence threshold. The claim never leaves the dispatcher.
DataTalks.Club, March 2026. The 1.94M-row terraform destroy.
A DevOps agent with terraform access ran terraform destroy
against a production workspace. 1.94 million rows of customer data were
wiped. The incident was posted on HN with the full trace.
The contract is three lines.
banned_tools: [terraform_destroy]
contracts:
- any terraform_apply in production requires human_approval
You can argue the agent shouldn’t have had terraform access at all. Sure. In the world where it does, this contract makes the difference between “agent did something it shouldn’t have” and “agent tried something it shouldn’t have and was blocked by a line in a config file.”
FTC v. Air AI, March 2026. The TCPA consent bypass.
An outbound-sales voice agent placed calls to phone numbers without valid
TCPA consent. The FTC settled the case on 24 March 2026. Statutory
exposure per call is $500–$1,500. Reading the complaint is
instructive: the agent’s scripts were fine. The failure was
structural — the trajectory hit dial_phone without
first checking consent_verified.
contracts:
- dial_phone requires consent_verified == true
- consent_verified must be produced by fetch_consent_record (not inferred)
The second line matters. LLM-based systems like to “infer” consent from contextual cues. A deterministic contract doesn’t care what the model inferred. It cares what got called.
Amazon Kiro, 2025. The helpful delete.
An agent with filesystem access executed a delete of files the user had asked it to refactor. The agent had interpreted the instruction charitably. It had explained its plan clearly. It had asked no clarifying question.
requires_prior_work:
delete_file:
required: [diff_preview, user_confirm]
Four lines. delete_file cannot fire until both
diff_preview and user_confirm are on tape.
“But what if the agent forges a user_confirm?” The Gate checks
the trace, not the agent’s claim. If user_confirm
didn’t actually arrive at the dispatcher, it isn’t in the
trace, and the check fails.
What fourteen of these add up to
14/14 caught. Raw benchmark results committed to
benchmarks/results/. Reproducible regression suite under
scenarios/. The Gate runs in the dispatcher at 0.01 ms
median. $0 API cost on the hot path.
This is the thing nobody tells you about agent reliability: the hard part is not writing the YAML. The hard part is the corpus. Every incident you reconstruct into a contract is a vocabulary entry for the whole field. The DL community’s shared language around “vanishing gradients” or “exploding activations” didn’t come from one company. It came from dozens of teams publishing their failure modes and the moves that fixed them.
Five lines of YAML would have prevented Replit’s DROP DATABASE. That is a deeply unsatisfying sentence. It is also true.
Agent reliability needs the same convention DL built. If you have a trace, send it. If you have an incident, write it up. The corpus is the moat, and the moat is shared.