In 2026 four separate academic groups published the same conclusion from four different angles. Deterministic trajectory verification is the right abstraction for agent safety. The papers are ToolGate, Solver-Aided Agent Verification, TraceSafe, and Verifiably Safe Agents. If you’re building agent infrastructure this year and you haven’t read them, you should.
I’ve spent the last two weeks reading them, side-by-side, against
the fewwords codebase. Here’s what each one proved, where the
overlap is, and why none of them is a product you can pip install
on a Tuesday.
ToolGate
ToolGate is the most product-adjacent of the four. The paper proposes a pre-execution gate that checks tool calls against a typed contract at dispatch. Types over tools. Preconditions over arguments. The prototype is written in Python, the evaluation is on synthetic benchmarks, and the wall-clock overhead quoted is 12 ms per call.
What ToolGate got right: the insertion point. The gate sits where we ship ours — between the agent’s decision and the dispatcher.
Where the product gap is: 12 ms is too much. The hot path can’t absorb it, so teams end up sampling, which defeats the guarantee. Also, ToolGate is a research prototype; there are no adapters for real trace formats (OpenAI, OTel, LangGraph). Before you can use it, you have to write the plumbing.
Solver-Aided Agent Verification
The most theoretically dense of the four. Compiles agent contracts to an SMT formula and discharges to Z3. Soundness proofs. A formal semantics of the trace. The paper’s headline result: verification is decidable for a class of agent programs that includes most real-world ones, at cost ~100 ms per check.
What it got right: deciding that you want real proofs, not LLM-verdicts. The reasoning has to terminate and be checkable.
Where the gap is: 100 ms is off by four orders of magnitude for production. The paper acknowledges this; future work is to compile the SMT discharge into specialised checkers for common contract shapes. That specialised-checker layer is exactly where fewwords lives.
TraceSafe
TraceSafe studies the trace as a linear structure and checks temporal-logic properties with Büchi automata. Ordering properties. Liveness. Fairness. The paper’s contribution is a careful decomposition of which temporal-logic fragments are checkable at what cost.
What it got right: the idea that ordering is a first-class property and needs a first-class logic. fewwords compiles ordering contracts to Büchi automata that are literally the construction TraceSafe describes.
Where the gap is: it’s a paper, not a library. The implementation is a research prototype in OCaml. If you want the construction in Python, adapted to real trace formats, accepting plain-English source and emitting a compiled automaton you can check in 0.01 ms — we shipped that. They didn’t, because that isn’t what academic incentives reward.
Verifiably Safe Agents
The most ambitious of the four. Proposes a full program logic for agents, with a soundness proof, an embedded compositionality argument, and worked examples on a toy agent. The authors are clear that their goal is foundations, not shipping.
What it got right: the insight that agent contracts need a compositional story. You want to be able to say “if agent A satisfies contract X and agent B satisfies contract Y, then the composition satisfies X and Y.” fewwords doesn’t yet have this in closed form; we have per-agent baselines and contract packs that intersect at the Gate.
Where the gap is: it’s a program-logic paper. The toolchain to apply it in practice is three years of engineering away.
Head to head
| Paper | Insertion | Latency | Real adapters | Ships | fewwords role |
|---|---|---|---|---|---|
| ToolGate | Pre-exec gate | 12 ms | No | Prototype | We’re 1,200× faster, real adapters |
| Solver-Aided | SMT post-hoc | ~100 ms | No | Prototype | Their “specialised checker” is our engine |
| TraceSafe | LTL over trace | n/a (OCaml) | No | Prototype | Their Büchi construction is in our compiler |
| Verifiably Safe | Program logic | n/a | No | Foundations | Their compositionality is our per-agent baseline |
| fewwords | Pre-exec gate | 0.01 ms | Six formats | MIT | — |
Why academia doesn’t ship
This is a feature of the system, not a bug. A research group wins by proving a new thing. A product wins by plumbing an old thing into every framework, at latency that fits the hot path, with error messages a junior engineer can read. Academia doesn’t reward adapters and error messages. It rewards theorems.
Both sides are necessary. The papers validate the axis. The engineering closes the gap. If I were writing a YC app (I am writing a YC app), the line that matters is: four independent papers in one year converged on the abstraction we built into. We’re the repo-invite-today, PyPI-after-five-paying-customers version of a research consensus.
Read them
All four are on arXiv. All four are short enough to read in an hour. If you build agent infrastructure in 2026 and you haven’t read them, you are flying blind on what the formal community has already worked out.
If after reading them you think we’re wrong about something, email. The corpus is a shared moat and so is the argument.