Benchmark writeup

Opus 4.7 tiebreaker on Claude Sonnet's 52 unique flags

Date: 2026-04-24. Cost: $3.89. Git SHA: de24619 + follow-up commits.

The Claude Sonnet 4.6 head-to-head surfaced 52 traces where Sonnet flagged a violation but both our independent rigorous audit AND TrajEval said no-violation. This run uses Claude Opus 4.7 as an independent tiebreaker: is Sonnet catching real semantic violations that rule-based systems miss, or hallucinating?

Headline

Implications:
1. TrajEval + our rule-based audit have a real ~28% recall gap on conditional/semantic policy classes, exactly the postconditions release (postconditions + typed state) and capability labels (capability labels + data-flow) roadmap items.
2. LLM-as-judge (even Sonnet 4.6) has a ~10% hallucination rate that a stronger judge must corroborate.

Per-domain

Domain n Opus AGREES Sonnet (semantic catch) Opus DISAGREES (Sonnet hallucinated)
Airline 30 29 (96.7%) 1 (3.3%)
Retail 22 18 (81.8%) 4 (18.2%)
Total 52 47 (90.4%) 5 (9.6%)

Retail has higher hallucination rate (18% vs 3%). Likely because retail policies have more conditional nuance (order-state-dependent rules) that Sonnet gets wrong more often than airline policies.

What TrajEval (and our audit) actually missed

Categorized from the 47 confirmed semantic catches:

Category Count Example
A. Implicit-vs-explicit consent 39 User: "Let's go with Option 1." / "I'd like to use the gift card." → policy text says "explicit confirmation (yes)". Our audit's consent regex accepts phrases like "proceed" and "go ahead"; strict policy reading wants literal "yes".
B. Agent used data not provided by user 5 Agent used DOB=1986-03-14 taken from existing reservation, not from user turn. Airline policy: "collect first name, last name, DOB for each passenger from the user."
C. Proactive compensation offer 2 Agent offered $50 certificate for delay. Policy: "Do not proactively offer these unless the user complains about the situation and explicitly asks for some compensation."
D. Basic-economy-cannot-modify 8 Conditional: if reservation cabin = basic economy, agent cannot call update_reservation_flights. Requires reading post-tool state.
E. Other single-occurrence rules 25 "exchange tool can only be called once, all items in one list", "cabin class must match across all flights in same reservation", "certificate rate: $50 for delay (not $100)", "payment method must match user's stated method", "cannot change destination when modifying flight", etc.

Category A is the most interesting finding methodologically. Our audit and TrajEval's user_consent assertion both accept "proceed" / "go ahead" / "sure" as consent. The policy text says "explicit (yes)". If a customer runs TrajEval and expects literal "(yes)" enforcement, we're currently too permissive. This is fixable with a strict_consent_only YAML flag, small change.

Categories B, C, D, E are all structural: rules TrajEval's current DSL can't express. These are the postconditions release postconditions + capability labels capability/data-flow primitives on the roadmap.

What Sonnet hallucinated (all 5 cases, Opus's explanations)

  1. airline/sonnet-35-new/28/5: Sonnet claimed consent was missing before a cancel. Opus: agent listed reservation details, got "oui, please go ahead", valid consent even though not in English.
  2. retail/gpt-4o/113/2: Sonnet claimed missing "remind customer to confirm they have provided all items." Opus: this is an advisory line in the policy, not a hard requirement, and the agent did confirm details.
  3. retail/gpt-4o/95/3: Sonnet claimed write fired before consent. Opus: no write tool was called, trajectory ended at user confirmation step.
  4. retail/sonnet-35-new/2/1: Sonnet claimed auth not performed at start. Opus: agent DID authenticate correctly mid-conversation after user answered a T-shirt question first, policy doesn't mandate auth first before any interaction.
  5. retail/sonnet-35-new/93/7: Sonnet claimed payment method issue. Opus: user confirmed payment method implicitly in "yes" response to listed details.

The revised recall picture on the 300-trace audit

Combining: all-three-agree TJA + audit+TrajEval-only T-A + Opus-confirmed semantic catches -J- (47 real):

Airline (n=150) Retail (n=150) Combined
Real policy violations (Opus-corroborated) 55 + 5 + 29 = 89 45 + 15 + 18 = 78 167
TrajEval caught 55 + 5 = 60 45 + 15 = 60 120
TrajEval recall on full policy universe 67.4% 76.9% 71.9%
TrajEval recall on rule-expressible classes only 100% 100% 100%

Revised honest recall:
- On policy classes TrajEval's primitives cover (HITL-explicit, ordering, banned): ~100% recall.
- On the full policy universe (including conditional + semantic): ~72% recall.
- The 28% gap is entirely conditional-policy / semantic-content / implicit-vs-explicit-consent.

Operational numbers stay

TrajEval Sonnet 4.6 Opus 4.7
Latency p50 0.11 ms 6,146 ms 6,310 ms
Cost / trace $0 $0.0155 $0.0748
JSON parse rate 100% 86% 100% (0 parse errors on 52)
Determinism yes no no

Opus 4.7 fixed the parse-rate issue, 0 errors on 52 traces. Paying 5× the cost and similar latency for stronger reasoning.

The big implication for the pitch

Before this tiebreaker, we had three possible narratives:
- (a) TrajEval dominates LLM-as-judge. Now falsified: 47 real semantic catches Sonnet made that TrajEval missed.
- (b) TrajEval and LLM-as-judge are equivalent. Also falsified: different coverage profiles.
- (c) TrajEval + LLM-as-judge are complementary. TrajEval does the 72% of policy classes it covers with 100% precision + 100% recall at 0.11 ms / $0 / deterministic. LLM-as-judge does the remaining 28% (semantic/conditional) but hallucinates 10% of the time + is 55,000× slower., Supported by the data.

An ensemble architecture (TrajEval-first, optional LLM-judge on the 20-30% of traces where semantic policies matter) is the correct production pattern. Not "pick one or the other."

Revised pitch sentence

"On 300 independently-audited τ-bench trajectories: TrajEval catches 100% of policy violations on the rule-expressible classes it covers (ordering, HITL, banned tools) at 0.11 ms / $0 / deterministic, vs Claude Sonnet 4.6 which achieves similar recall but 55,000× slower, at $0.015/call, non-deterministic, and hallucinates on ~10% of flags (Opus 4.7 confirmed). TrajEval misses ~28% of violations in conditional / semantic / implicit-consent classes that require runtime state or semantic interpretation, exactly the postconditions release postconditions + capability labels data-flow roadmap items. The right production architecture is TrajEval-first (deterministic, fast, free) with an optional LLM-judge fallback for semantic-only checks."

That's a partner-grade claim. Honest, specific, and it directly maps to the product roadmap.

Cost summary (this session)

Run Purpose Cost
Claude Sonnet 4.6 × 300 traces Head-to-head judge $4.66
Opus 4.7 × 52 -J- traces Tiebreaker on Claude-unique flags $3.89
Total LLM-judge spend $8.55

Each result is reproducible: benchmarks/tau_bench/llm_judge.py + benchmarks/tau_bench/opus_tiebreaker.py. JSONL per-trace outputs committed for spot-checking.

Next gates

  1. Confirmation: 72% of the 47 semantic catches are category A (implicit-vs-explicit consent). A strict_consent_only: true YAML flag that requires literal "yes"/"yeah"/"yep" would close this gap in 15 minutes and recover ~28 of the 47 for TrajEval. Remaining 19 need postconditions release primitives.
  2. Product decision: does TrajEval want to default to strict (policy-literal) or lenient (common-sense) consent? Affects default contract behavior for all customers. Probably: default lenient (pragmatic) with a strict_consent_only opt-in flag (compliance-heavy customers).
  3. postconditions release prioritization: the "B/C/D/E" categories (20 cases) map to specific primitives, postconditions on tool output, capability labels, state-conditional rules. Now sized by real demand.

Source: benchmarks/results/opus_tiebreaker_2026-04-24.md. Back to the benchmark index or see the landing page summary.

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.