Benchmark writeup

Internal benchmark, baseline 2026-04-24

First entry in the the τ-bench head-to-head "Benchmark-ASAP" track. Established as the reference point for all subsequent changes (postconditions, capability labels, Z3 linter).

Run

uv run trajeval benchmark benchmarks/traces/ --config benchmarks/config.yml

Result

Metric Value
Detection rate (recall on failures) 10/10, 100%
False positive rate (on clean traces) 0/10, 0%
Precision 100%
F1 1.00
Total latency 136ms
Per-trace latency 6.8ms

What this number is, and isn't

Is: the detection rate of TrajEval's full assertion stack over its own 20-trace unit corpus on the current HEAD. Proves no regression from earlier claims (v0.1.0 README: 100% / 0% / F1=1.00 / <50ms total).

Isn't: a comparison against any competitor or external benchmark. Latency here (6.8ms/trace) is the full-stack detection path, not the hot-path guard (which benchmarks separately at 0.01ms median, see incident_guard_bench.py and benchmarks/results/incident_*.json).

Next (the τ-bench head-to-head pending work)

  1. TauBench integration, adapter from TauBench trajectory format → Trace. Publish TrajEval numbers on the same benchmark Solver-Aided used.
  2. TraceSafe-Bench availability check, arXiv 2604.07223 (April 2026). If released: run against the 12 risk categories. If not: contact authors.
  3. benchmarks/results/compared_table.md, partner-ready table: TrajEval vs best LLM-judge vs ToolGate vs Solver-Aided across detection × precision × recall × latency.

Tracking: [the internal task plan.


Source: benchmarks/results/internal_baseline_2026-04-24.md. Back to the benchmark index or see the landing page summary.

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.