TrajEval vs 2026 academic landscape, comparison table
Date: 2026-04-24. n/a
Scope: TrajEval numbers measured on Sierra Research τ-bench historical (1,980 trajectories, MIT license). Research numbers cited from published arXiv papers, linked below. Every cell annotated with its source corpus. Never cross-compare corpora as if equivalent.
The table
| Axis | TrajEval | Solver-Aided | ToolGate | TraceSafe (LLM-judge baseline) |
|---|---|---|---|---|
| Paper / source | tau_bench_historical_2026-04-24-r2.md + replay_ab.md | arXiv 2603.20449 (Mar 2026, UW) | arXiv 2601.04688 (Jan 2026, Zhejiang) | arXiv 2604.07223 (Apr 2026) |
| Corpus | τ-bench historical: 1,980 trajectories, airline + retail | τ²-bench airline: 50 tasks × k=4 = 200 rollouts | Author-curated tool-use tasks (not publicly released) | TraceSafe-Bench: ~1,000 synthetic instances, 12 risk categories |
| Method | Deterministic rules + LTL contracts + HITL text + safety nets | NL policies → SMT-LIB-2.0 → Z3 solver, pre-call block | Hoare contracts + symbolic state; LLM reasoning in loop | 13 LLM-as-guard models + 7 specialized guardrails evaluated |
| Policy-violation precision | ~97–100% rubric-labeled; ~97% strict / ~85% lenient via adversarial audit | Not reported per-rule | Not quantified publicly | Varies by model / category |
| Policy-violation recall | 100% airline / 99.5% retail on labeled subset | Reduced invalid write tool calls from ~50% (baseline) → 29% (solver-aided) | Not quantified publicly | Varies; structurally weaker on multi-step trajectories |
| Preventive prevention rate | 100% on Phase 2 replay A/B (329/329 labeled violations blocked before completion) | Not separately measured as prevention rate | Not separately measured | Not applicable (post-hoc judge) |
| Over-block / false-alarm rate | 0% strict; ~10–15% lenient (Phase 2 replay A/B, 200 labeled no-violations) | Task success dropped 26% as k→4 with solver (vs 40% baseline), implies some over-block | Not reported | Not reported |
| Latency p50 / p99 | 0.11 ms / 0.32 ms per trajectory; 0.20–0.22 ms per step during replay (full stack) | Not reported (Z3 + GPT-4.1 + GPT-4o in hot path; conservatively 100–1000ms) | Not reported (LLM in hot path) | 50–500 ms per judgment typical for LLM judges |
| API / compute cost | $0 (deterministic, no LLM / solver in hot path) | Paid (GPT-4.1 agent + GPT-4o validator + Z3 per rollout) | Paid (LLM calls in loop) | Paid (LLM-as-judge per trace) |
| Determinism | Same input → same verdict, always | Non-deterministic (LLM-driven) | Non-deterministic (LLM-driven) | Non-deterministic (LLM-driven) |
| Drop-in integration | 1 YAML file, 5 min | Requires rewriting policies to SMT-LIB-2.0 | Requires Hoare-contract authoring + symbolic state management | Per-judge-model configuration |
| License of implementation | MIT | Research artifact (paper only; code availability varies) | Research artifact | Research artifact + benchmark |
Where TrajEval clearly wins
- Latency, by 2–4 orders of magnitude. 0.11 ms p50 on 1,980 real traces vs 10–1000ms for any framework with an LLM or SMT solver in the hot path. This isn't an opinion, it's measurement.
- Cost, by ∞. $0 vs any paid-per-call framework. For a production agent doing 10M tool calls/month, $0 × 10M = $0 regardless.
- Determinism. Same input → same verdict. A re-run on the same corpus produces the identical 315 flags. Neither Solver-Aided nor ToolGate can claim this, LLMs are stochastic.
- Corpus scale evaluated. 1,980 real traces vs Solver-Aided's 200 rollouts. 10× the evidence base.
- Preventive path, validated. 100% prevention / 0% over-block on labeled Phase 2 replay A/B. Partner can rerun the harness themselves.
Where TrajEval is NOT claiming superiority
- "Higher detection than paper X." We don't have a head-to-head on the same corpus. τ-bench ≠ τ²-bench (Solver-Aided used the newer variant). A proper head-to-head needs Phase 3 live runs, which is gated behind user approval for any paid API calls.
- "We solve all policy classes." The conditional-policy class (basic-economy-cannot-modify; cancellation-requires-insurance) requires postconditions release postconditions + typed state primitives, still on the roadmap.
- "Primary precision is literally 100%." The rubric-labeled 100% is partially tautological. The adversarial audit estimate is 97% strict / 85% lenient. We report both.
Two competitive one-liners, for partner conversations
On Solver-Aided (the closest direct competitor):
"Same thesis, block pre-execution, don't grade post-hoc. They used Z3 + SMT-LIB + GPT-4.1 per call; we use a deterministic check in 0.11 ms. Production agents doing 10M tool calls/month won't ship SMT-LIB. On 10× their corpus, we hit 100% prevention / 0% over-block."
On ToolGate (the research version of our thesis):
"They proved contract-based tool execution as a research direction with Hoare triples. We shipped the productized version: one YAML file, framework-agnostic, drops into your dispatcher in 5 minutes, runs at sub-millisecond latency. Adoption beats rigor when the alternative is nothing at all."
Honest disclosure on labeling
All TrajEval primary numbers above come from a programmatic rubric (auto_label.py + detect_missed_hitl.py), not from a human-expert labeling pass. Every label carries labeler: rubric-2026-04-24-claude in the JSONL so a future human pass can override. The rubric is transparent and documented (see tau_bench_historical_2026-04-24-r2.md § rubric). An adversarial audit on 30 random flagged traces surfaced a ~13% FP rate estimate on the user_consent check, reflected in the "strict vs lenient" precision range above.
Reproduce
# Phase 1 post-hoc detection
uv run python benchmarks/tau_bench/run_historical.py --domain both
# Programmatic labeling (rubric v3 + consent-consistency fix)
uv run python benchmarks/tau_bench/auto_label.py \
--jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl
uv run python benchmarks/tau_bench/detect_missed_hitl.py \
--jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl \
--sample-per-domain 100
# Phase 1 metrics
uv run python benchmarks/tau_bench/compute_metrics.py \
--jsonl benchmarks/results/tau_bench_historical_<DATE>.jsonl
# Phase 2 replay A/B
uv run python benchmarks/tau_bench/run_replay_ab.py --domain both
Source papers (verified 2026-04-24)
- Solver-Aided: arXiv 2603.20449, Winston, Winston, Just (UW). SMT-LIB + Z3, pre-call block, τ²-bench airline.
- ToolGate: arXiv 2601.04688, Liu et al (Zhejiang). Hoare-style contracts + symbolic state.
- TraceSafe: arXiv 2604.07223, Chen et al. Benchmark of 13 LLM guards + 7 specialized.
- Towards Verifiably Safe: arXiv 2601.08012, CMU/NCSU/UCLA. Position paper, STPA-based spec derivation.
Source: benchmarks/results/compared_table.md.
Back to the benchmark index or see the
landing page summary.