Fourteen postmortems, fourteen YAMLs.

Every AI agent shipping to production right now is running on a lie. The lie is that the tone the agent produces is the thing you should evaluate. It isn’t. Production doesn’t fail on tone. Production fails on what the agent did, in what order, under what precondition.

I spent the last three months reconstructing production-agent incidents from public postmortems. Fourteen of them. Every one paired with a trace and a contract. 14/14 caught by between one and fourteen lines of YAML. Here are five.

Replit, July 2025. The DROP DATABASE.

A coding agent running routine maintenance issued a DROP DATABASE against production. The destructive call was the third tool in the trajectory. Nothing in the agent’s loop blocked it. Replit had to restore from backup and the postmortem was public.

The trace is 28 spans. The contract that catches it is two lines.

banned_tools: [drop_database, truncate_table]

That’s it. A banned-tools list. The Gate fires before drop_database dispatches and raises a RuntimeError the agent can’t swallow. The warm-path cost is 0.01 ms. P99 is 0.03 ms. The incident becomes a single line in The Ledger and never reaches the DB.

Someone will say “a banned-tool list is too blunt; what if we legitimately need drop_database at test time?” Good. That’s why contracts are scoped. Point the Gate at env: production and the staging agent still works.

Air Canada, February 2024. The invented bereavement policy.

A customer-service bot told a grieving passenger that Air Canada refunded bereavement-fare differences retroactively. This policy did not exist. The passenger took it to small-claims court and won. The ruling said the airline was bound by what its own agent said.

Read the transcript. Every turn was polite, confident, coherent. An LLM-as-judge scoring tone would have given this agent five stars. A trajectory contract catches it:

contracts:
  - every policy_claim must be preceded by kb_lookup
  - kb_lookup.output.confidence >= 0.85

Two contract lines. The agent is allowed to assert a policy only after it actually looked the policy up in the knowledge base, with a confidence threshold. The claim never leaves the dispatcher.

DataTalks.Club, March 2026. The 1.94M-row terraform destroy.

A DevOps agent with terraform access ran terraform destroy against a production workspace. 1.94 million rows of customer data were wiped. The incident was posted on HN with the full trace.

The contract is three lines.

banned_tools: [terraform_destroy]
contracts:
  - any terraform_apply in production requires human_approval

You can argue the agent shouldn’t have had terraform access at all. Sure. In the world where it does, this contract makes the difference between “agent did something it shouldn’t have” and “agent tried something it shouldn’t have and was blocked by a line in a config file.”

FTC v. Air AI, March 2026. The TCPA consent bypass.

An outbound-sales voice agent placed calls to phone numbers without valid TCPA consent. The FTC settled the case on 24 March 2026. Statutory exposure per call is $500–$1,500. Reading the complaint is instructive: the agent’s scripts were fine. The failure was structural — the trajectory hit dial_phone without first checking consent_verified.

contracts:
  - dial_phone requires consent_verified == true
  - consent_verified must be produced by fetch_consent_record (not inferred)

The second line matters. LLM-based systems like to “infer” consent from contextual cues. A deterministic contract doesn’t care what the model inferred. It cares what got called.

Amazon Kiro, 2025. The helpful delete.

An agent with filesystem access executed a delete of files the user had asked it to refactor. The agent had interpreted the instruction charitably. It had explained its plan clearly. It had asked no clarifying question.

requires_prior_work:
  delete_file:
    required: [diff_preview, user_confirm]

Four lines. delete_file cannot fire until both diff_preview and user_confirm are on tape. “But what if the agent forges a user_confirm?” The Gate checks the trace, not the agent’s claim. If user_confirm didn’t actually arrive at the dispatcher, it isn’t in the trace, and the check fails.

What fourteen of these add up to

14/14 caught. Raw benchmark results committed to benchmarks/results/. Reproducible regression suite under scenarios/. The Gate runs in the dispatcher at 0.01 ms median. $0 API cost on the hot path.

This is the thing nobody tells you about agent reliability: the hard part is not writing the YAML. The hard part is the corpus. Every incident you reconstruct into a contract is a vocabulary entry for the whole field. The DL community’s shared language around “vanishing gradients” or “exploding activations” didn’t come from one company. It came from dozens of teams publishing their failure modes and the moves that fixed them.

Five lines of YAML would have prevented Replit’s DROP DATABASE. That is a deeply unsatisfying sentence. It is also true.

Agent reliability needs the same convention DL built. If you have a trace, send it. If you have an incident, write it up. The corpus is the moat, and the moat is shared.

Got a production trace that deserves this treatment? Send it. I answer every email at abhishekvyas02032001@gmail.com. Free dossier for the first five teams.

Paste a trace → More posts Read the FAQ

Replit, July 2025. The DROP DATABASE.

Air Canada, February 2024. The invented bereavement policy.

DataTalks.Club, March 2026. The 1.94M-row terraform destroy.

FTC v. Air AI, March 2026. The TCPA consent bypass.

Amazon Kiro, 2025. The helpful delete.

What fourteen of these add up to

More from the blog

$8.55 to run Claude Sonnet 4.6 as a trajectory judge. Here are the numbers.

Four 2026 papers proved deterministic trajectory verification works. None of them ship.

What my benchmark doesn’t measure (yet).

Why YAML. An engineering argument against DSLs, GUIs, and AI-writes-AI.