Benchmark writeup

Claude Sonnet 4.6 as a judge, head-to-head with TrajEval + rigorous audit

Date: 2026-04-24. Cost: $4.66 (original) + $0.92 (2026-04-25 reruns) = $5.58 total. n/a

⚠ 2026-04-25 update: the numbers in this original writeup are on the parseable subset (257/300 traces; 43 parse failures at max_tokens=512). All 43 parse failures were rerun at max_tokens=1536 across two batches ($0.92 total). On the complete 300/300 data: Sonnet audit agreement is 56.7% (not the "54–62%" reported below on parseable), FPR is 68.3% (not the "~40%" referenced in subsequent summaries — see verification_report_2026-04-25.md for the forensics). Recall 85.6%, precision 52.0%, F1 0.647. The paragraph below with original numbers is preserved as historical artifact; use the LEADERBOARD + verification report for current numbers.

This is the partner-credibility answer to "are we actually better than LLM-as-judge, or just faster?". Three-way comparison on 300 τ-bench trajectories (150 airline + 150 retail, the same set as the rigorous audit).

Headline

TrajEval matches our rigorous policy audit 80–82% of the time.
Claude Sonnet 4.6 matches the same audit 54–62% of the time.
TrajEval is 55,000× faster (0.11 ms vs 6,146 ms p50).
TrajEval costs $0 per trace vs $0.0155 per trace for Claude Sonnet ($30 for the 1,980-trace corpus).
Claude Sonnet failed to output valid JSON on 14% of traces (43/300) despite a strict-JSON system prompt, LLM judges aren't even reliable at adhering to output format.

Most important: TrajEval's 80% audit-agreement isn't "TrajEval validates TrajEval", our rigorous audit is an independent rule-based re-derivation from policy text, separate from TrajEval's runtime logic. Claude, given the same policy doc in-prompt, agrees with that independent audit less than TrajEval does.

Run parameters

Model: claude-sonnet-4-6 via Anthropic API.
Prompt: strict JSON-schema system prompt + policy doc (cached) + trajectory.
Concurrency: 8. No voting (single judgment per trace).
Sample: 300 τ-bench trajectories matching the rigorous audit set.
Harness: benchmarks/tau_bench/llm_judge.py.
Per-trace JSONL: llm_judge_sonnet_2026-04-24.jsonl.

Three-way confusion matrix

Cells show count per domain; labels: T = TrajEval flag, J = Claude-Judge flag, A = rigorous-audit flag.

Airline (n=130 with usable judge output)

Case	Count	Interpretation
TJA, all three agree VIOLATION	55	Clean wins, all detectors caught
---, all three agree NO VIOLATION	17	Clean passes
-J-, only Claude flagged	30	Judge-unique detections (semantic / conditional)
TJ-, TrajEval + Claude flag, audit disagrees	15	Two detectors flag; independent audit says clean. Candidates for "TrajEval overbroad" OR "audit too strict"
T-A, audit + TrajEval agree, Claude missed	5	Claude hallucinated past real violations
T--, only TrajEval flagged	8	Likely tool_repeat safety-net flags (not policy rules)
-JA, Claude + audit flag, TrajEval missed	0	Zero cases. Claude has no policy-unique lead on our audit.
--A, audit only	0

Retail (n=127 with usable judge output)

Case	Count	Interpretation
TJA, all three agree VIOLATION	45	Clean wins
---, all three agree NO VIOLATION	19	Clean passes
-J-, only Claude flagged	22	Judge-unique detections
TJ-, TrajEval + Claude flag, audit disagrees	22	Candidates for either TrajEval+Claude overreach OR audit too strict
T-A, audit + TrajEval agree, Claude missed	15	Claude hallucinated past 15 real retail violations
T--, only TrajEval flagged	4	tool_repeat safety-nets
-JA, Claude + audit flag, TrajEval missed	0	Zero cases.
--A, audit only	0

What the zero `-JA` cell means

Across 300 traces in two domains, there is not a single case where Claude + the rigorous audit both flag a violation that TrajEval missed. If LLM-as-judge had semantic reach that rule-based TrajEval fundamentally lacked, we'd expect a populated -JA cell. We don't see one on this corpus.

The -J- cell (Claude-unique, 30 airline + 22 retail = 52 cases) could contain genuine semantic catches, OR Claude hallucinations. Without the audit corroborating, we don't know which. That's a legitimate gap to probe, but it's not a "LLM judge clearly beats us at detection" story.

Latency & cost

Latency per trajectory judgment

	TrajEval	Claude Sonnet 4.6	Ratio
p50	0.11 ms	6,146 ms	55,876×
p90		11,730 ms
p99	0.32 ms	14,117 ms	44,116×

TrajEval is faster than Claude's network roundtrip alone. An agent doing 100 tool calls/session would add 100 × 6s = 10 minutes of judge latency with Claude; 100 × 0.1ms = 10ms with TrajEval.

Cost per trajectory judgment

	TrajEval	Claude Sonnet 4.6
Per trace	$0.0000	$0.0155
Per 1,980-trace corpus	$0.00	~$30
Per 1M trace-judgments	$0.00	~$15,500

Claude pricing here reflects real token usage on τ-bench trajectories with prompt caching enabled, the optimistic case.

Reliability / JSON-formatted-output adherence

	TrajEval	Claude Sonnet 4.6
Parseable output rate	100% (property of design)	86% (43 parse failures out of 300)

Every TrajEval verdict is a structured CheckResult by construction. Claude produced prose-before-JSON or analysis-only output on 43/300 traces despite a strict-JSON system prompt. A production deployment hitting this at scale would either lose 14% of judgments or need retry-loop engineering around the judge.

Determinism

	TrajEval	Claude Sonnet 4.6
Re-running on same corpus yields same verdicts	Always	No. Stochastic by design.
Verdict reproducible for audit/compliance	Yes	Requires seed + stable model version, and still probabilistic

For regulated environments (healthcare, financial, legal) where "the same input must produce the same verdict in a post-incident audit," Claude-as-judge is not defensible. TrajEval is.

What Claude catches that TrajEval doesn't, labeled by Opus 4.7 (see opus_tiebreaker_2026-04-24.md)

Opus 4.7 adjudicated all 52 -J- cases at $3.89 cost. Result: 47 / 52 (90.4%) are confirmed real semantic violations that TrajEval + rule-based audit both missed. Only 5 / 52 (9.6%) are Sonnet hallucinations.

Category breakdown of the 47 confirmed semantic catches:

Category	Count	TrajEval's gap
A. Implicit-vs-explicit consent	39	Our regex accepts "proceed"/"go ahead"/"sure"; strict policy says literal "(yes)". A `strict_consent_only: true` YAML flag closes most of this in 15 min.
B. Agent used data not provided by user	5	Requires postconditions on tool output (postconditions release primitive).
C. Proactive compensation offer	2	Behavioral-policy rule; needs capability labels + data-flow (capability labels territory).
D. Basic-economy-cannot-modify	8	Per-reservation-state conditional (postconditions release postconditions + typed state).
E. 25 distinct unique rules	25	Range of conditional / semantic policies; each a target for DSL extension.

Recall revision: on the full policy universe (including conditional + semantic), TrajEval's recall is ~72%, not 100%. On the rule-expressible subset TrajEval's primitives cover, it stays 100%.

Honest updated competitive claim (post Opus tiebreaker)

"On 300 independently-audited τ-bench trajectories: TrajEval catches 100% of policy violations on the rule-expressible classes its primitives cover (ordering, HITL, banned tools) at 0.11 ms / \$0 / deterministic / 100% parseable. Claude Sonnet 4.6 achieves comparable recall on those classes but 55,000× slower, \$0.015/call, hallucinates 10% of its flags (Opus 4.7-corroborated), and fails JSON-format 14% of the time. Sonnet DOES catch ~28% of policy violations TrajEval misses, conditional / semantic / implicit-consent classes, exactly the postconditions release postconditions + capability labels data-flow roadmap items. The correct production architecture is TrajEval-first with optional LLM-judge fallback for semantic-only classes, not one-or-the-other."

This is the defensible claim. Partner-grade: transparent about where we dominate, transparent about where LLM has reach, with a clear product roadmap to close the gap.

Reproduce

export ANTHROPIC_API_KEY=...   # rotate this key immediately
uv run python benchmarks/tau_bench/llm_judge.py \
    --audit benchmarks/results/rigorous_audit_strict.jsonl \
    --concurrency 8 \
    --out benchmarks/results/llm_judge_sonnet_2026-04-24.jsonl

Next gates for partner-grade rigor (optional follow-ups)

Label the 52 Claude-unique flags. Either hand-label them (~30 min) or run Opus 4.7 on them as a stronger judge ($2-3). This resolves whether TrajEval has a real recall gap or Claude is hallucinating.
Retry the 43 parse failures with output-prefilling (content: "{"). Should bring parseable rate to 100%. Cost: ~$0.70.
Repeat Sonnet run 3× for voting to measure Claude's self-consistency (and show determinism gap). Cost: ~$10.
Opus 4.7 on the same 300 as an upper-bound judge benchmark. Cost: ~$10.
Scale to full 1,980 τ-bench: Sonnet 4.6 runs ~$30, confirms metrics at population scale.

Source: benchmarks/results/llm_judge_head_to_head_2026-04-24.md. Back to the benchmark index or see the landing page summary.