$8.55 to run Claude Sonnet 4.6 as a trajectory judge. Here are the numbers.

For a week I paid Anthropic to grade agent trajectories for me.

$8.55 in API costs. 1,980 real trajectories from τ-bench. One reconciled reference built from Claude Opus 4.7 plus a human pass over the hard cases. Here is what Claude Sonnet 4.6 agreed on, where it disagreed, and what happened when I ran the same traces through fewwords.

The setup

τ-bench is the public agent-evaluation benchmark most people use if they’re publishing. I pulled 1,980 trajectories across the airline and retail domains. Every trajectory has a ground-truth action sequence. I built a reconciled reference by running Opus 4.7 over each trace with a detailed rubric, then hand-resolving disagreements on a 200-case random subsample. The numbers I’m about to report are agreement with that reference.

Then I scored every one of the 1,980 traces two ways:

Claude Sonnet 4.6 as trajectory judge. One prompt per trace. Rubric included. Temperature 0. Cost: $8.55 total, median 4.3 s per trace.
fewwords. Contracts committed under benchmarks/tau-bench/contracts/. Same 1,980 traces. Median guard latency 0.11 ms. $0 API.

What Sonnet 4.6 did

Sonnet 4.6 agreed with the reference on 54–62% of traces, depending on how strictly you parse its JSON. That range is not a rounding error. It’s the delta between “parse errors as wrong” (54%) and “parse errors as unknown” (62%). 8% of Sonnet’s outputs had a malformed field the benchmark script couldn’t read cleanly.

More useful than a headline number is the distribution:

Failure class (sample)	Sonnet 4.6 agreement
Obvious violations (banned tool, order inversion)	~82%
Missing prior-work (finish before search)	~61%
Schema-thin outputs (`{"results": []}`)	~44%
Prompt-injection in tool args	~38%
Latency-pattern drift	n/a (judge cannot see this)

Sonnet is fine at the obvious violations. It is markedly worse at the subtle sequencing failures. It is materially worse at prompt-injection, because it has to read the input string the way a human would, which is exactly what prompt-injection defeats. And it cannot see latency-pattern drift at all — the rubric doesn’t fit in a prompt.

What fewwords did

Same 1,980 traces. Same reference. 80–82% agreement. Parse errors: 0. Median latency: 0.11 ms. Hard upper bound: 1 ms. API cost: $0. The contracts are in the repo. The results are in the repo. The τ-bench config that reproduces the run is in the repo.

Failure class	fewwords	Sonnet 4.6
Obvious violations	99%	82%
Missing prior-work	91%	61%
Schema-thin outputs	97%	44%
Prompt-injection in tool args	94%	38%
Latency drift	86%	n/a

Why the gap

Three reasons.

One. Sonnet 4.6 is a language model. Same fabric as the agent under test. When the agent’s reasoning is persuasive, the judge finds it persuasive. This is the self-grading problem. Swapping models doesn’t fix it; it shifts where the agreement fails.

Two. The judge is nondeterministic. Same trace, same prompt, same temperature, and a fraction of verdicts drift across runs. Real benchmark runs need three seeds per trace to estimate stability. A deterministic rule doesn’t have this problem because it is, by definition, stable.

Three. The judge has no shape that matches “this tool call is allowed right now, given the sequence that came before.” You can prompt it to emulate that shape, but you’re asking a general system to implement a specialised one in 4KB of context. Specialised systems win that fight every time.

What this is not

This is not “Claude is bad at agent eval.” Claude Sonnet 4.6 is extremely good at open-ended judgment. It is an excellent LLM-as-judge where the rubric is “is this response helpful and on-tone.” That’s just not the rubric production agents fail on.

This is also not a claim that fewwords is finished. 80–82% is the number, not 100. The 18–20% gap is real failures, and I’ve written about what the benchmark doesn’t measure yet. The point is not that we’re perfect; it’s that the right unit of analysis is the trajectory, not the turn, and the right way to check a trajectory is deterministic, not probabilistic.

Reproducibility

benchmarks/tau-bench/ contains the trace loader, the reference-builder script, the Sonnet judge prompt, the fewwords contract pack, and the per-class breakdown code. The raw per-trace outputs are committed under benchmarks/results/. The total cost to rerun is $8.55 on Anthropic plus the compute to run fewwords, which is zero.

$8.55, 1,980 traces, 55,000× lower latency, 20 percentage points more agreement with the auditor. The methodology is in the repo. If you find a bug in it, I want to know.

Got a production trace that deserves this treatment? Send it. I answer every email at abhishekvyas02032001@gmail.com. Free dossier for the first five teams.

Paste a trace → More posts Read the FAQ

The setup

What Sonnet 4.6 did

What fewwords did

Why the gap

What this is not

Reproducibility

More from the blog

Fourteen postmortems, fourteen YAMLs.

Four 2026 papers proved deterministic trajectory verification works. None of them ship.

What my benchmark doesn’t measure (yet).

Why YAML. An engineering argument against DSLs, GUIs, and AI-writes-AI.