Benchmark writeup

Internal benchmark, baseline 2026-04-24

First entry in the the τ-bench head-to-head "Benchmark-ASAP" track. Established as the reference point for all subsequent changes (postconditions, capability labels, Z3 linter).

Run

uv run trajeval benchmark benchmarks/traces/ --config benchmarks/config.yml

Git SHA: n/a
Date: 2026-04-24
Corpus: 20 traces (10 clean + 10 violation), see benchmarks/README.md
Config: benchmarks/config.yml

Result

Metric	Value
Detection rate (recall on failures)	10/10, 100%
False positive rate (on clean traces)	0/10, 0%
Precision	100%
F1	1.00
Total latency	136ms
Per-trace latency	6.8ms

What this number is, and isn't

Is: the detection rate of TrajEval's full assertion stack over its own 20-trace unit corpus on the current HEAD. Proves no regression from earlier claims (v0.1.0 README: 100% / 0% / F1=1.00 / <50ms total).

Isn't: a comparison against any competitor or external benchmark. Latency here (6.8ms/trace) is the full-stack detection path, not the hot-path guard (which benchmarks separately at 0.01ms median, see incident_guard_bench.py and benchmarks/results/incident_*.json).

Next (the τ-bench head-to-head pending work)

TauBench integration, adapter from TauBench trajectory format → Trace. Publish TrajEval numbers on the same benchmark Solver-Aided used.
TraceSafe-Bench availability check, arXiv 2604.07223 (April 2026). If released: run against the 12 risk categories. If not: contact authors.
benchmarks/results/compared_table.md, partner-ready table: TrajEval vs best LLM-judge vs ToolGate vs Solver-Aided across detection × precision × recall × latency.

Tracking: [the internal task plan.

Source: benchmarks/results/internal_baseline_2026-04-24.md. Back to the benchmark index or see the landing page summary.