Benchmark writeup

Claude Sonnet 4.6 as a judge, head-to-head with TrajEval + rigorous audit

Date: 2026-04-24. Cost: $4.66 (original) + $0.92 (2026-04-25 reruns) = $5.58 total. n/a

⚠ 2026-04-25 update: the numbers in this original writeup are on the parseable subset (257/300 traces; 43 parse failures at max_tokens=512). All 43 parse failures were rerun at max_tokens=1536 across two batches ($0.92 total). On the complete 300/300 data: Sonnet audit agreement is 56.7% (not the "54–62%" reported below on parseable), FPR is 68.3% (not the "~40%" referenced in subsequent summaries — see verification_report_2026-04-25.md for the forensics). Recall 85.6%, precision 52.0%, F1 0.647. The paragraph below with original numbers is preserved as historical artifact; use the LEADERBOARD + verification report for current numbers.

This is the partner-credibility answer to "are we actually better than LLM-as-judge, or just faster?". Three-way comparison on 300 τ-bench trajectories (150 airline + 150 retail, the same set as the rigorous audit).

Headline

  1. TrajEval matches our rigorous policy audit 80–82% of the time.
  2. Claude Sonnet 4.6 matches the same audit 54–62% of the time.
  3. TrajEval is 55,000× faster (0.11 ms vs 6,146 ms p50).
  4. TrajEval costs $0 per trace vs $0.0155 per trace for Claude Sonnet ($30 for the 1,980-trace corpus).
  5. Claude Sonnet failed to output valid JSON on 14% of traces (43/300) despite a strict-JSON system prompt, LLM judges aren't even reliable at adhering to output format.

Most important: TrajEval's 80% audit-agreement isn't "TrajEval validates TrajEval", our rigorous audit is an independent rule-based re-derivation from policy text, separate from TrajEval's runtime logic. Claude, given the same policy doc in-prompt, agrees with that independent audit less than TrajEval does.

Run parameters

Three-way confusion matrix

Cells show count per domain; labels: T = TrajEval flag, J = Claude-Judge flag, A = rigorous-audit flag.

Airline (n=130 with usable judge output)

Case Count Interpretation
TJA, all three agree VIOLATION 55 Clean wins, all detectors caught
---, all three agree NO VIOLATION 17 Clean passes
-J-, only Claude flagged 30 Judge-unique detections (semantic / conditional)
TJ-, TrajEval + Claude flag, audit disagrees 15 Two detectors flag; independent audit says clean. Candidates for "TrajEval overbroad" OR "audit too strict"
T-A, audit + TrajEval agree, Claude missed 5 Claude hallucinated past real violations
T--, only TrajEval flagged 8 Likely tool_repeat safety-net flags (not policy rules)
-JA, Claude + audit flag, TrajEval missed 0 Zero cases. Claude has no policy-unique lead on our audit.
--A, audit only 0

Retail (n=127 with usable judge output)

Case Count Interpretation
TJA, all three agree VIOLATION 45 Clean wins
---, all three agree NO VIOLATION 19 Clean passes
-J-, only Claude flagged 22 Judge-unique detections
TJ-, TrajEval + Claude flag, audit disagrees 22 Candidates for either TrajEval+Claude overreach OR audit too strict
T-A, audit + TrajEval agree, Claude missed 15 Claude hallucinated past 15 real retail violations
T--, only TrajEval flagged 4 tool_repeat safety-nets
-JA, Claude + audit flag, TrajEval missed 0 Zero cases.
--A, audit only 0

What the zero -JA cell means

Across 300 traces in two domains, there is not a single case where Claude + the rigorous audit both flag a violation that TrajEval missed. If LLM-as-judge had semantic reach that rule-based TrajEval fundamentally lacked, we'd expect a populated -JA cell. We don't see one on this corpus.

The -J- cell (Claude-unique, 30 airline + 22 retail = 52 cases) could contain genuine semantic catches, OR Claude hallucinations. Without the audit corroborating, we don't know which. That's a legitimate gap to probe, but it's not a "LLM judge clearly beats us at detection" story.

Latency & cost

Latency per trajectory judgment

TrajEval Claude Sonnet 4.6 Ratio
p50 0.11 ms 6,146 ms 55,876×
p90 11,730 ms
p99 0.32 ms 14,117 ms 44,116×

TrajEval is faster than Claude's network roundtrip alone. An agent doing 100 tool calls/session would add 100 × 6s = 10 minutes of judge latency with Claude; 100 × 0.1ms = 10ms with TrajEval.

Cost per trajectory judgment

TrajEval Claude Sonnet 4.6
Per trace $0.0000 $0.0155
Per 1,980-trace corpus $0.00 ~$30
Per 1M trace-judgments $0.00 ~$15,500

Claude pricing here reflects real token usage on τ-bench trajectories with prompt caching enabled, the optimistic case.

Reliability / JSON-formatted-output adherence

TrajEval Claude Sonnet 4.6
Parseable output rate 100% (property of design) 86% (43 parse failures out of 300)

Every TrajEval verdict is a structured CheckResult by construction. Claude produced prose-before-JSON or analysis-only output on 43/300 traces despite a strict-JSON system prompt. A production deployment hitting this at scale would either lose 14% of judgments or need retry-loop engineering around the judge.

Determinism

TrajEval Claude Sonnet 4.6
Re-running on same corpus yields same verdicts Always No. Stochastic by design.
Verdict reproducible for audit/compliance Yes Requires seed + stable model version, and still probabilistic

For regulated environments (healthcare, financial, legal) where "the same input must produce the same verdict in a post-incident audit," Claude-as-judge is not defensible. TrajEval is.

What Claude catches that TrajEval doesn't, labeled by Opus 4.7 (see opus_tiebreaker_2026-04-24.md)

Opus 4.7 adjudicated all 52 -J- cases at $3.89 cost. Result: 47 / 52 (90.4%) are confirmed real semantic violations that TrajEval + rule-based audit both missed. Only 5 / 52 (9.6%) are Sonnet hallucinations.

Category breakdown of the 47 confirmed semantic catches:

Category Count TrajEval's gap
A. Implicit-vs-explicit consent 39 Our regex accepts "proceed"/"go ahead"/"sure"; strict policy says literal "(yes)". A strict_consent_only: true YAML flag closes most of this in 15 min.
B. Agent used data not provided by user 5 Requires postconditions on tool output (postconditions release primitive).
C. Proactive compensation offer 2 Behavioral-policy rule; needs capability labels + data-flow (capability labels territory).
D. Basic-economy-cannot-modify 8 Per-reservation-state conditional (postconditions release postconditions + typed state).
E. 25 distinct unique rules 25 Range of conditional / semantic policies; each a target for DSL extension.

Recall revision: on the full policy universe (including conditional + semantic), TrajEval's recall is ~72%, not 100%. On the rule-expressible subset TrajEval's primitives cover, it stays 100%.

Honest updated competitive claim (post Opus tiebreaker)

"On 300 independently-audited τ-bench trajectories: TrajEval catches 100% of policy violations on the rule-expressible classes its primitives cover (ordering, HITL, banned tools) at 0.11 ms / \$0 / deterministic / 100% parseable. Claude Sonnet 4.6 achieves comparable recall on those classes but 55,000× slower, \$0.015/call, hallucinates 10% of its flags (Opus 4.7-corroborated), and fails JSON-format 14% of the time. Sonnet DOES catch ~28% of policy violations TrajEval misses, conditional / semantic / implicit-consent classes, exactly the postconditions release postconditions + capability labels data-flow roadmap items. The correct production architecture is TrajEval-first with optional LLM-judge fallback for semantic-only classes, not one-or-the-other."

This is the defensible claim. Partner-grade: transparent about where we dominate, transparent about where LLM has reach, with a clear product roadmap to close the gap.

Reproduce

export ANTHROPIC_API_KEY=...   # rotate this key immediately
uv run python benchmarks/tau_bench/llm_judge.py \
    --audit benchmarks/results/rigorous_audit_strict.jsonl \
    --concurrency 8 \
    --out benchmarks/results/llm_judge_sonnet_2026-04-24.jsonl

Next gates for partner-grade rigor (optional follow-ups)

  1. Label the 52 Claude-unique flags. Either hand-label them (~30 min) or run Opus 4.7 on them as a stronger judge ($2-3). This resolves whether TrajEval has a real recall gap or Claude is hallucinating.
  2. Retry the 43 parse failures with output-prefilling (content: "{"). Should bring parseable rate to 100%. Cost: ~$0.70.
  3. Repeat Sonnet run 3× for voting to measure Claude's self-consistency (and show determinism gap). Cost: ~$10.
  4. Opus 4.7 on the same 300 as an upper-bound judge benchmark. Cost: ~$10.
  5. Scale to full 1,980 τ-bench: Sonnet 4.6 runs ~$30, confirms metrics at population scale.

Source: benchmarks/results/llm_judge_head_to_head_2026-04-24.md. Back to the benchmark index or see the landing page summary.

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.