For a week I paid Anthropic to grade agent trajectories for me.
$8.55 in API costs. 1,980 real trajectories from τ-bench. One reconciled reference built from Claude Opus 4.7 plus a human pass over the hard cases. Here is what Claude Sonnet 4.6 agreed on, where it disagreed, and what happened when I ran the same traces through fewwords.
The setup
τ-bench is the public agent-evaluation benchmark most people use if they’re publishing. I pulled 1,980 trajectories across the airline and retail domains. Every trajectory has a ground-truth action sequence. I built a reconciled reference by running Opus 4.7 over each trace with a detailed rubric, then hand-resolving disagreements on a 200-case random subsample. The numbers I’m about to report are agreement with that reference.
Then I scored every one of the 1,980 traces two ways:
- Claude Sonnet 4.6 as trajectory judge. One prompt per trace. Rubric included. Temperature 0. Cost: $8.55 total, median 4.3 s per trace.
-
fewwords. Contracts committed under
benchmarks/tau-bench/contracts/. Same 1,980 traces. Median guard latency 0.11 ms. $0 API.
What Sonnet 4.6 did
Sonnet 4.6 agreed with the reference on 54–62% of traces, depending on how strictly you parse its JSON. That range is not a rounding error. It’s the delta between “parse errors as wrong” (54%) and “parse errors as unknown” (62%). 8% of Sonnet’s outputs had a malformed field the benchmark script couldn’t read cleanly.
More useful than a headline number is the distribution:
| Failure class (sample) | Sonnet 4.6 agreement |
|---|---|
| Obvious violations (banned tool, order inversion) | ~82% |
| Missing prior-work (finish before search) | ~61% |
Schema-thin outputs ({"results": []}) | ~44% |
| Prompt-injection in tool args | ~38% |
| Latency-pattern drift | n/a (judge cannot see this) |
Sonnet is fine at the obvious violations. It is markedly worse at the subtle sequencing failures. It is materially worse at prompt-injection, because it has to read the input string the way a human would, which is exactly what prompt-injection defeats. And it cannot see latency-pattern drift at all — the rubric doesn’t fit in a prompt.
What fewwords did
Same 1,980 traces. Same reference. 80–82% agreement. Parse errors: 0. Median latency: 0.11 ms. Hard upper bound: 1 ms. API cost: $0. The contracts are in the repo. The results are in the repo. The τ-bench config that reproduces the run is in the repo.
| Failure class | fewwords | Sonnet 4.6 |
|---|---|---|
| Obvious violations | 99% | 82% |
| Missing prior-work | 91% | 61% |
| Schema-thin outputs | 97% | 44% |
| Prompt-injection in tool args | 94% | 38% |
| Latency drift | 86% | n/a |
Why the gap
Three reasons.
One. Sonnet 4.6 is a language model. Same fabric as the agent under test. When the agent’s reasoning is persuasive, the judge finds it persuasive. This is the self-grading problem. Swapping models doesn’t fix it; it shifts where the agreement fails.
Two. The judge is nondeterministic. Same trace, same prompt, same temperature, and a fraction of verdicts drift across runs. Real benchmark runs need three seeds per trace to estimate stability. A deterministic rule doesn’t have this problem because it is, by definition, stable.
Three. The judge has no shape that matches “this tool call is allowed right now, given the sequence that came before.” You can prompt it to emulate that shape, but you’re asking a general system to implement a specialised one in 4KB of context. Specialised systems win that fight every time.
What this is not
This is not “Claude is bad at agent eval.” Claude Sonnet 4.6 is extremely good at open-ended judgment. It is an excellent LLM-as-judge where the rubric is “is this response helpful and on-tone.” That’s just not the rubric production agents fail on.
This is also not a claim that fewwords is finished. 80–82% is the number, not 100. The 18–20% gap is real failures, and I’ve written about what the benchmark doesn’t measure yet. The point is not that we’re perfect; it’s that the right unit of analysis is the trajectory, not the turn, and the right way to check a trajectory is deterministic, not probabilistic.
Reproducibility
benchmarks/tau-bench/ contains the trace loader, the
reference-builder script, the Sonnet judge prompt, the fewwords contract
pack, and the per-class breakdown code. The raw per-trace outputs are
committed under benchmarks/results/. The total cost to rerun
is $8.55 on Anthropic plus the compute to run fewwords, which is zero.
$8.55, 1,980 traces, 55,000× lower latency, 20 percentage points more agreement with the auditor. The methodology is in the repo. If you find a bug in it, I want to know.