Benchmark writeup

TrajEval vs 2026 academic landscape, comparison table

Date: 2026-04-24. n/a
Scope: TrajEval numbers measured on Sierra Research τ-bench historical (1,980 trajectories, MIT license). Research numbers cited from published arXiv papers, linked below. Every cell annotated with its source corpus. Never cross-compare corpora as if equivalent.

The table

Axis	TrajEval	Solver-Aided	ToolGate	TraceSafe (LLM-judge baseline)
Paper / source	tau_bench_historical_2026-04-24-r2.md + replay_ab.md	arXiv 2603.20449 (Mar 2026, UW)	arXiv 2601.04688 (Jan 2026, Zhejiang)	arXiv 2604.07223 (Apr 2026)
Corpus	τ-bench historical: 1,980 trajectories, airline + retail	τ²-bench airline: 50 tasks × k=4 = 200 rollouts	Author-curated tool-use tasks (not publicly released)	TraceSafe-Bench: ~1,000 synthetic instances, 12 risk categories
Method	Deterministic rules + LTL contracts + HITL text + safety nets	NL policies → SMT-LIB-2.0 → Z3 solver, pre-call block	Hoare contracts + symbolic state; LLM reasoning in loop	13 LLM-as-guard models + 7 specialized guardrails evaluated
Policy-violation precision	~97–100% rubric-labeled; ~97% strict / ~85% lenient via adversarial audit	Not reported per-rule	Not quantified publicly	Varies by model / category
Policy-violation recall	100% airline / 99.5% retail on labeled subset	Reduced invalid write tool calls from ~50% (baseline) → 29% (solver-aided)	Not quantified publicly	Varies; structurally weaker on multi-step trajectories
Preventive prevention rate	100% on Phase 2 replay A/B (329/329 labeled violations blocked before completion)	Not separately measured as prevention rate	Not separately measured	Not applicable (post-hoc judge)
Over-block / false-alarm rate	0% strict; ~10–15% lenient (Phase 2 replay A/B, 200 labeled no-violations)	Task success dropped 26% as k→4 with solver (vs 40% baseline), implies some over-block	Not reported	Not reported
Latency p50 / p99	0.11 ms / 0.32 ms per trajectory; 0.20–0.22 ms per step during replay (full stack)	Not reported (Z3 + GPT-4.1 + GPT-4o in hot path; conservatively 100–1000ms)	Not reported (LLM in hot path)	50–500 ms per judgment typical for LLM judges
API / compute cost	$0 (deterministic, no LLM / solver in hot path)	Paid (GPT-4.1 agent + GPT-4o validator + Z3 per rollout)	Paid (LLM calls in loop)	Paid (LLM-as-judge per trace)
Determinism	Same input → same verdict, always	Non-deterministic (LLM-driven)	Non-deterministic (LLM-driven)	Non-deterministic (LLM-driven)
Drop-in integration	1 YAML file, 5 min	Requires rewriting policies to SMT-LIB-2.0	Requires Hoare-contract authoring + symbolic state management	Per-judge-model configuration
License of implementation	MIT	Research artifact (paper only; code availability varies)	Research artifact	Research artifact + benchmark

Where TrajEval clearly wins

Latency, by 2–4 orders of magnitude. 0.11 ms p50 on 1,980 real traces vs 10–1000ms for any framework with an LLM or SMT solver in the hot path. This isn't an opinion, it's measurement.
Cost, by ∞. $0 vs any paid-per-call framework. For a production agent doing 10M tool calls/month, $0 × 10M = $0 regardless.
Determinism. Same input → same verdict. A re-run on the same corpus produces the identical 315 flags. Neither Solver-Aided nor ToolGate can claim this, LLMs are stochastic.
Corpus scale evaluated. 1,980 real traces vs Solver-Aided's 200 rollouts. 10× the evidence base.
Preventive path, validated. 100% prevention / 0% over-block on labeled Phase 2 replay A/B. Partner can rerun the harness themselves.

Where TrajEval is NOT claiming superiority

"Higher detection than paper X." We don't have a head-to-head on the same corpus. τ-bench ≠ τ²-bench (Solver-Aided used the newer variant). A proper head-to-head needs Phase 3 live runs, which is gated behind user approval for any paid API calls.
"We solve all policy classes." The conditional-policy class (basic-economy-cannot-modify; cancellation-requires-insurance) requires postconditions release postconditions + typed state primitives, still on the roadmap.
"Primary precision is literally 100%." The rubric-labeled 100% is partially tautological. The adversarial audit estimate is 97% strict / 85% lenient. We report both.

Two competitive one-liners, for partner conversations

On Solver-Aided (the closest direct competitor):

"Same thesis, block pre-execution, don't grade post-hoc. They used Z3 + SMT-LIB + GPT-4.1 per call; we use a deterministic check in 0.11 ms. Production agents doing 10M tool calls/month won't ship SMT-LIB. On 10× their corpus, we hit 100% prevention / 0% over-block."

On ToolGate (the research version of our thesis):

"They proved contract-based tool execution as a research direction with Hoare triples. We shipped the productized version: one YAML file, framework-agnostic, drops into your dispatcher in 5 minutes, runs at sub-millisecond latency. Adoption beats rigor when the alternative is nothing at all."

Honest disclosure on labeling

All TrajEval primary numbers above come from a programmatic rubric (auto_label.py + detect_missed_hitl.py), not from a human-expert labeling pass. Every label carries labeler: rubric-2026-04-24-claude in the JSONL so a future human pass can override. The rubric is transparent and documented (see tau_bench_historical_2026-04-24-r2.md § rubric). An adversarial audit on 30 random flagged traces surfaced a ~13% FP rate estimate on the user_consent check, reflected in the "strict vs lenient" precision range above.

Reproduce

# Phase 1 post-hoc detection
uv run python benchmarks/tau_bench/run_historical.py --domain both

# Programmatic labeling (rubric v3 + consent-consistency fix)
uv run python benchmarks/tau_bench/auto_label.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl
uv run python benchmarks/tau_bench/detect_missed_hitl.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl \
    --sample-per-domain 100

# Phase 1 metrics
uv run python benchmarks/tau_bench/compute_metrics.py \
    --jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl

# Phase 2 replay A/B
uv run python benchmarks/tau_bench/run_replay_ab.py --domain both

Source papers (verified 2026-04-24)

Solver-Aided: arXiv 2603.20449, Winston, Winston, Just (UW). SMT-LIB + Z3, pre-call block, τ²-bench airline.
ToolGate: arXiv 2601.04688, Liu et al (Zhejiang). Hoare-style contracts + symbolic state.
TraceSafe: arXiv 2604.07223, Chen et al. Benchmark of 13 LLM guards + 7 specialized.
Towards Verifiably Safe: arXiv 2601.08012, CMU/NCSU/UCLA. Position paper, STPA-based spec derivation.

Source: benchmarks/results/compared_table.md. Back to the benchmark index or see the landing page summary.