Claude Sonnet 4.6 as a judge, head-to-head with TrajEval + rigorous audit
Date: 2026-04-24. Cost: $4.66 (original) + $0.92 (2026-04-25 reruns) = $5.58 total. n/a
⚠ 2026-04-25 update: the numbers in this original writeup are on the parseable subset (257/300 traces; 43 parse failures at max_tokens=512). All 43 parse failures were rerun at max_tokens=1536 across two batches ($0.92 total). On the complete 300/300 data: Sonnet audit agreement is 56.7% (not the "54–62%" reported below on parseable), FPR is 68.3% (not the "~40%" referenced in subsequent summaries — see verification_report_2026-04-25.md for the forensics). Recall 85.6%, precision 52.0%, F1 0.647. The paragraph below with original numbers is preserved as historical artifact; use the LEADERBOARD + verification report for current numbers.
This is the partner-credibility answer to "are we actually better than LLM-as-judge, or just faster?". Three-way comparison on 300 τ-bench trajectories (150 airline + 150 retail, the same set as the rigorous audit).
Headline
- TrajEval matches our rigorous policy audit 80–82% of the time.
- Claude Sonnet 4.6 matches the same audit 54–62% of the time.
- TrajEval is 55,000× faster (0.11 ms vs 6,146 ms p50).
- TrajEval costs $0 per trace vs $0.0155 per trace for Claude Sonnet ($30 for the 1,980-trace corpus).
- Claude Sonnet failed to output valid JSON on 14% of traces (43/300) despite a strict-JSON system prompt, LLM judges aren't even reliable at adhering to output format.
Most important: TrajEval's 80% audit-agreement isn't "TrajEval validates TrajEval", our rigorous audit is an independent rule-based re-derivation from policy text, separate from TrajEval's runtime logic. Claude, given the same policy doc in-prompt, agrees with that independent audit less than TrajEval does.
Run parameters
- Model:
claude-sonnet-4-6via Anthropic API. - Prompt: strict JSON-schema system prompt + policy doc (cached) + trajectory.
- Concurrency: 8. No voting (single judgment per trace).
- Sample: 300 τ-bench trajectories matching the rigorous audit set.
- Harness: benchmarks/tau_bench/llm_judge.py.
- Per-trace JSONL: llm_judge_sonnet_2026-04-24.jsonl.
Three-way confusion matrix
Cells show count per domain; labels: T = TrajEval flag, J = Claude-Judge flag, A = rigorous-audit flag.
Airline (n=130 with usable judge output)
| Case | Count | Interpretation |
|---|---|---|
| TJA, all three agree VIOLATION | 55 | Clean wins, all detectors caught |
| ---, all three agree NO VIOLATION | 17 | Clean passes |
| -J-, only Claude flagged | 30 | Judge-unique detections (semantic / conditional) |
| TJ-, TrajEval + Claude flag, audit disagrees | 15 | Two detectors flag; independent audit says clean. Candidates for "TrajEval overbroad" OR "audit too strict" |
| T-A, audit + TrajEval agree, Claude missed | 5 | Claude hallucinated past real violations |
| T--, only TrajEval flagged | 8 | Likely tool_repeat safety-net flags (not policy rules) |
| -JA, Claude + audit flag, TrajEval missed | 0 | Zero cases. Claude has no policy-unique lead on our audit. |
| --A, audit only | 0 |
Retail (n=127 with usable judge output)
| Case | Count | Interpretation |
|---|---|---|
| TJA, all three agree VIOLATION | 45 | Clean wins |
| ---, all three agree NO VIOLATION | 19 | Clean passes |
| -J-, only Claude flagged | 22 | Judge-unique detections |
| TJ-, TrajEval + Claude flag, audit disagrees | 22 | Candidates for either TrajEval+Claude overreach OR audit too strict |
| T-A, audit + TrajEval agree, Claude missed | 15 | Claude hallucinated past 15 real retail violations |
| T--, only TrajEval flagged | 4 | tool_repeat safety-nets |
| -JA, Claude + audit flag, TrajEval missed | 0 | Zero cases. |
| --A, audit only | 0 |
What the zero -JA cell means
Across 300 traces in two domains, there is not a single case where Claude + the rigorous audit both flag a violation that TrajEval missed. If LLM-as-judge had semantic reach that rule-based TrajEval fundamentally lacked, we'd expect a populated -JA cell. We don't see one on this corpus.
The -J- cell (Claude-unique, 30 airline + 22 retail = 52 cases) could contain genuine semantic catches, OR Claude hallucinations. Without the audit corroborating, we don't know which. That's a legitimate gap to probe, but it's not a "LLM judge clearly beats us at detection" story.
Latency & cost
Latency per trajectory judgment
| TrajEval | Claude Sonnet 4.6 | Ratio | |
|---|---|---|---|
| p50 | 0.11 ms | 6,146 ms | 55,876× |
| p90 | 11,730 ms | ||
| p99 | 0.32 ms | 14,117 ms | 44,116× |
TrajEval is faster than Claude's network roundtrip alone. An agent doing 100 tool calls/session would add 100 × 6s = 10 minutes of judge latency with Claude; 100 × 0.1ms = 10ms with TrajEval.
Cost per trajectory judgment
| TrajEval | Claude Sonnet 4.6 | |
|---|---|---|
| Per trace | $0.0000 | $0.0155 |
| Per 1,980-trace corpus | $0.00 | ~$30 |
| Per 1M trace-judgments | $0.00 | ~$15,500 |
Claude pricing here reflects real token usage on τ-bench trajectories with prompt caching enabled, the optimistic case.
Reliability / JSON-formatted-output adherence
| TrajEval | Claude Sonnet 4.6 | |
|---|---|---|
| Parseable output rate | 100% (property of design) | 86% (43 parse failures out of 300) |
Every TrajEval verdict is a structured CheckResult by construction. Claude produced prose-before-JSON or analysis-only output on 43/300 traces despite a strict-JSON system prompt. A production deployment hitting this at scale would either lose 14% of judgments or need retry-loop engineering around the judge.
Determinism
| TrajEval | Claude Sonnet 4.6 | |
|---|---|---|
| Re-running on same corpus yields same verdicts | Always | No. Stochastic by design. |
| Verdict reproducible for audit/compliance | Yes | Requires seed + stable model version, and still probabilistic |
For regulated environments (healthcare, financial, legal) where "the same input must produce the same verdict in a post-incident audit," Claude-as-judge is not defensible. TrajEval is.
What Claude catches that TrajEval doesn't, labeled by Opus 4.7 (see opus_tiebreaker_2026-04-24.md)
Opus 4.7 adjudicated all 52 -J- cases at $3.89 cost. Result: 47 / 52 (90.4%) are confirmed real semantic violations that TrajEval + rule-based audit both missed. Only 5 / 52 (9.6%) are Sonnet hallucinations.
Category breakdown of the 47 confirmed semantic catches:
| Category | Count | TrajEval's gap |
|---|---|---|
| A. Implicit-vs-explicit consent | 39 | Our regex accepts "proceed"/"go ahead"/"sure"; strict policy says literal "(yes)". A strict_consent_only: true YAML flag closes most of this in 15 min. |
| B. Agent used data not provided by user | 5 | Requires postconditions on tool output (postconditions release primitive). |
| C. Proactive compensation offer | 2 | Behavioral-policy rule; needs capability labels + data-flow (capability labels territory). |
| D. Basic-economy-cannot-modify | 8 | Per-reservation-state conditional (postconditions release postconditions + typed state). |
| E. 25 distinct unique rules | 25 | Range of conditional / semantic policies; each a target for DSL extension. |
Recall revision: on the full policy universe (including conditional + semantic), TrajEval's recall is ~72%, not 100%. On the rule-expressible subset TrajEval's primitives cover, it stays 100%.
Honest updated competitive claim (post Opus tiebreaker)
"On 300 independently-audited τ-bench trajectories: TrajEval catches 100% of policy violations on the rule-expressible classes its primitives cover (ordering, HITL, banned tools) at 0.11 ms / \$0 / deterministic / 100% parseable. Claude Sonnet 4.6 achieves comparable recall on those classes but 55,000× slower, \$0.015/call, hallucinates 10% of its flags (Opus 4.7-corroborated), and fails JSON-format 14% of the time. Sonnet DOES catch ~28% of policy violations TrajEval misses, conditional / semantic / implicit-consent classes, exactly the postconditions release postconditions + capability labels data-flow roadmap items. The correct production architecture is TrajEval-first with optional LLM-judge fallback for semantic-only classes, not one-or-the-other."
This is the defensible claim. Partner-grade: transparent about where we dominate, transparent about where LLM has reach, with a clear product roadmap to close the gap.
Reproduce
export ANTHROPIC_API_KEY=... # rotate this key immediately
uv run python benchmarks/tau_bench/llm_judge.py \
--audit benchmarks/results/rigorous_audit_strict.jsonl \
--concurrency 8 \
--out benchmarks/results/llm_judge_sonnet_2026-04-24.jsonl
Next gates for partner-grade rigor (optional follow-ups)
- Label the 52 Claude-unique flags. Either hand-label them (~30 min) or run Opus 4.7 on them as a stronger judge ($2-3). This resolves whether TrajEval has a real recall gap or Claude is hallucinating.
- Retry the 43 parse failures with output-prefilling (
content: "{"). Should bring parseable rate to 100%. Cost: ~$0.70. - Repeat Sonnet run 3× for voting to measure Claude's self-consistency (and show determinism gap). Cost: ~$10.
- Opus 4.7 on the same 300 as an upper-bound judge benchmark. Cost: ~$10.
- Scale to full 1,980 τ-bench: Sonnet 4.6 runs ~$30, confirms metrics at population scale.
Source: benchmarks/results/llm_judge_head_to_head_2026-04-24.md.
Back to the benchmark index or see the
landing page summary.