Four 2026 papers proved deterministic trajectory verification works. None of them ship.

In 2026 four separate academic groups published the same conclusion from four different angles. Deterministic trajectory verification is the right abstraction for agent safety. The papers are ToolGate, Solver-Aided Agent Verification, TraceSafe, and Verifiably Safe Agents. If you’re building agent infrastructure this year and you haven’t read them, you should.

I’ve spent the last two weeks reading them, side-by-side, against the fewwords codebase. Here’s what each one proved, where the overlap is, and why none of them is a product you can pip install on a Tuesday.

ToolGate

ToolGate is the most product-adjacent of the four. The paper proposes a pre-execution gate that checks tool calls against a typed contract at dispatch. Types over tools. Preconditions over arguments. The prototype is written in Python, the evaluation is on synthetic benchmarks, and the wall-clock overhead quoted is 12 ms per call.

What ToolGate got right: the insertion point. The gate sits where we ship ours — between the agent’s decision and the dispatcher.

Where the product gap is: 12 ms is too much. The hot path can’t absorb it, so teams end up sampling, which defeats the guarantee. Also, ToolGate is a research prototype; there are no adapters for real trace formats (OpenAI, OTel, LangGraph). Before you can use it, you have to write the plumbing.

Solver-Aided Agent Verification

The most theoretically dense of the four. Compiles agent contracts to an SMT formula and discharges to Z3. Soundness proofs. A formal semantics of the trace. The paper’s headline result: verification is decidable for a class of agent programs that includes most real-world ones, at cost ~100 ms per check.

What it got right: deciding that you want real proofs, not LLM-verdicts. The reasoning has to terminate and be checkable.

Where the gap is: 100 ms is off by four orders of magnitude for production. The paper acknowledges this; future work is to compile the SMT discharge into specialised checkers for common contract shapes. That specialised-checker layer is exactly where fewwords lives.

TraceSafe

TraceSafe studies the trace as a linear structure and checks temporal-logic properties with Büchi automata. Ordering properties. Liveness. Fairness. The paper’s contribution is a careful decomposition of which temporal-logic fragments are checkable at what cost.

What it got right: the idea that ordering is a first-class property and needs a first-class logic. fewwords compiles ordering contracts to Büchi automata that are literally the construction TraceSafe describes.

Where the gap is: it’s a paper, not a library. The implementation is a research prototype in OCaml. If you want the construction in Python, adapted to real trace formats, accepting plain-English source and emitting a compiled automaton you can check in 0.01 ms — we shipped that. They didn’t, because that isn’t what academic incentives reward.

Verifiably Safe Agents

The most ambitious of the four. Proposes a full program logic for agents, with a soundness proof, an embedded compositionality argument, and worked examples on a toy agent. The authors are clear that their goal is foundations, not shipping.

What it got right: the insight that agent contracts need a compositional story. You want to be able to say “if agent A satisfies contract X and agent B satisfies contract Y, then the composition satisfies X and Y.” fewwords doesn’t yet have this in closed form; we have per-agent baselines and contract packs that intersect at the Gate.

Where the gap is: it’s a program-logic paper. The toolchain to apply it in practice is three years of engineering away.

Head to head

Paper	Insertion	Latency	Real adapters	Ships	fewwords role
ToolGate	Pre-exec gate	12 ms	No	Prototype	We’re 1,200× faster, real adapters
Solver-Aided	SMT post-hoc	~100 ms	No	Prototype	Their “specialised checker” is our engine
TraceSafe	LTL over trace	n/a (OCaml)	No	Prototype	Their Büchi construction is in our compiler
Verifiably Safe	Program logic	n/a	No	Foundations	Their compositionality is our per-agent baseline
fewwords	Pre-exec gate	0.01 ms	Six formats	MIT	—

Why academia doesn’t ship

This is a feature of the system, not a bug. A research group wins by proving a new thing. A product wins by plumbing an old thing into every framework, at latency that fits the hot path, with error messages a junior engineer can read. Academia doesn’t reward adapters and error messages. It rewards theorems.

Both sides are necessary. The papers validate the axis. The engineering closes the gap. If I were writing a YC app (I am writing a YC app), the line that matters is: four independent papers in one year converged on the abstraction we built into. We’re the repo-invite-today, PyPI-after-five-paying-customers version of a research consensus.

Read them

All four are on arXiv. All four are short enough to read in an hour. If you build agent infrastructure in 2026 and you haven’t read them, you are flying blind on what the formal community has already worked out.

If after reading them you think we’re wrong about something, email. The corpus is a shared moat and so is the argument.

Got a production trace that deserves this treatment? Send it. I answer every email at abhishekvyas02032001@gmail.com. Free dossier for the first five teams.

Paste a trace → More posts Read the FAQ

ToolGate

Solver-Aided Agent Verification

TraceSafe

Verifiably Safe Agents

Head to head

Why academia doesn’t ship

Read them

More from the blog

Fourteen postmortems, fourteen YAMLs.

$8.55 to run Claude Sonnet 4.6 as a trajectory judge. Here are the numbers.

What my benchmark doesn’t measure (yet).

Why YAML. An engineering argument against DSLs, GUIs, and AI-writes-AI.