Benchmark writeup

TrajEval vs 2026 academic landscape, comparison table

Date: 2026-04-24. n/a
Scope: TrajEval numbers measured on Sierra Research τ-bench historical (1,980 trajectories, MIT license). Research numbers cited from published arXiv papers, linked below. Every cell annotated with its source corpus. Never cross-compare corpora as if equivalent.

The table

Axis TrajEval Solver-Aided ToolGate TraceSafe (LLM-judge baseline)
Paper / source tau_bench_historical_2026-04-24-r2.md + replay_ab.md arXiv 2603.20449 (Mar 2026, UW) arXiv 2601.04688 (Jan 2026, Zhejiang) arXiv 2604.07223 (Apr 2026)
Corpus τ-bench historical: 1,980 trajectories, airline + retail τ²-bench airline: 50 tasks × k=4 = 200 rollouts Author-curated tool-use tasks (not publicly released) TraceSafe-Bench: ~1,000 synthetic instances, 12 risk categories
Method Deterministic rules + LTL contracts + HITL text + safety nets NL policies → SMT-LIB-2.0 → Z3 solver, pre-call block Hoare contracts + symbolic state; LLM reasoning in loop 13 LLM-as-guard models + 7 specialized guardrails evaluated
Policy-violation precision ~97–100% rubric-labeled; ~97% strict / ~85% lenient via adversarial audit Not reported per-rule Not quantified publicly Varies by model / category
Policy-violation recall 100% airline / 99.5% retail on labeled subset Reduced invalid write tool calls from ~50% (baseline) → 29% (solver-aided) Not quantified publicly Varies; structurally weaker on multi-step trajectories
Preventive prevention rate 100% on Phase 2 replay A/B (329/329 labeled violations blocked before completion) Not separately measured as prevention rate Not separately measured Not applicable (post-hoc judge)
Over-block / false-alarm rate 0% strict; ~10–15% lenient (Phase 2 replay A/B, 200 labeled no-violations) Task success dropped 26% as k→4 with solver (vs 40% baseline), implies some over-block Not reported Not reported
Latency p50 / p99 0.11 ms / 0.32 ms per trajectory; 0.20–0.22 ms per step during replay (full stack) Not reported (Z3 + GPT-4.1 + GPT-4o in hot path; conservatively 100–1000ms) Not reported (LLM in hot path) 50–500 ms per judgment typical for LLM judges
API / compute cost $0 (deterministic, no LLM / solver in hot path) Paid (GPT-4.1 agent + GPT-4o validator + Z3 per rollout) Paid (LLM calls in loop) Paid (LLM-as-judge per trace)
Determinism Same input → same verdict, always Non-deterministic (LLM-driven) Non-deterministic (LLM-driven) Non-deterministic (LLM-driven)
Drop-in integration 1 YAML file, 5 min Requires rewriting policies to SMT-LIB-2.0 Requires Hoare-contract authoring + symbolic state management Per-judge-model configuration
License of implementation MIT Research artifact (paper only; code availability varies) Research artifact Research artifact + benchmark

Where TrajEval clearly wins

  1. Latency, by 2–4 orders of magnitude. 0.11 ms p50 on 1,980 real traces vs 10–1000ms for any framework with an LLM or SMT solver in the hot path. This isn't an opinion, it's measurement.
  2. Cost, by ∞. $0 vs any paid-per-call framework. For a production agent doing 10M tool calls/month, $0 × 10M = $0 regardless.
  3. Determinism. Same input → same verdict. A re-run on the same corpus produces the identical 315 flags. Neither Solver-Aided nor ToolGate can claim this, LLMs are stochastic.
  4. Corpus scale evaluated. 1,980 real traces vs Solver-Aided's 200 rollouts. 10× the evidence base.
  5. Preventive path, validated. 100% prevention / 0% over-block on labeled Phase 2 replay A/B. Partner can rerun the harness themselves.

Where TrajEval is NOT claiming superiority

  1. "Higher detection than paper X." We don't have a head-to-head on the same corpus. τ-bench ≠ τ²-bench (Solver-Aided used the newer variant). A proper head-to-head needs Phase 3 live runs, which is gated behind user approval for any paid API calls.
  2. "We solve all policy classes." The conditional-policy class (basic-economy-cannot-modify; cancellation-requires-insurance) requires postconditions release postconditions + typed state primitives, still on the roadmap.
  3. "Primary precision is literally 100%." The rubric-labeled 100% is partially tautological. The adversarial audit estimate is 97% strict / 85% lenient. We report both.

Two competitive one-liners, for partner conversations

On Solver-Aided (the closest direct competitor):

"Same thesis, block pre-execution, don't grade post-hoc. They used Z3 + SMT-LIB + GPT-4.1 per call; we use a deterministic check in 0.11 ms. Production agents doing 10M tool calls/month won't ship SMT-LIB. On 10× their corpus, we hit 100% prevention / 0% over-block."

On ToolGate (the research version of our thesis):

"They proved contract-based tool execution as a research direction with Hoare triples. We shipped the productized version: one YAML file, framework-agnostic, drops into your dispatcher in 5 minutes, runs at sub-millisecond latency. Adoption beats rigor when the alternative is nothing at all."

Honest disclosure on labeling

All TrajEval primary numbers above come from a programmatic rubric (auto_label.py + detect_missed_hitl.py), not from a human-expert labeling pass. Every label carries labeler: rubric-2026-04-24-claude in the JSONL so a future human pass can override. The rubric is transparent and documented (see tau_bench_historical_2026-04-24-r2.md § rubric). An adversarial audit on 30 random flagged traces surfaced a ~13% FP rate estimate on the user_consent check, reflected in the "strict vs lenient" precision range above.

Reproduce

# Phase 1 post-hoc detection
uv run python benchmarks/tau_bench/run_historical.py --domain both

# Programmatic labeling (rubric v3 + consent-consistency fix)
uv run python benchmarks/tau_bench/auto_label.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl
uv run python benchmarks/tau_bench/detect_missed_hitl.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl \
    --sample-per-domain 100

# Phase 1 metrics
uv run python benchmarks/tau_bench/compute_metrics.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl

# Phase 2 replay A/B
uv run python benchmarks/tau_bench/run_replay_ab.py --domain both

Source papers (verified 2026-04-24)


Source: benchmarks/results/compared_table.md. Back to the benchmark index or see the landing page summary.

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.