Internal benchmark, baseline 2026-04-24
First entry in the the τ-bench head-to-head "Benchmark-ASAP" track. Established as the reference point for all subsequent changes (postconditions, capability labels, Z3 linter).
Run
uv run trajeval benchmark benchmarks/traces/ --config benchmarks/config.yml
- Git SHA: n/a
- Date: 2026-04-24
- Corpus: 20 traces (10 clean + 10 violation), see benchmarks/README.md
- Config: benchmarks/config.yml
Result
| Metric | Value |
|---|---|
| Detection rate (recall on failures) | 10/10, 100% |
| False positive rate (on clean traces) | 0/10, 0% |
| Precision | 100% |
| F1 | 1.00 |
| Total latency | 136ms |
| Per-trace latency | 6.8ms |
What this number is, and isn't
Is: the detection rate of TrajEval's full assertion stack over its own 20-trace unit corpus on the current HEAD. Proves no regression from earlier claims (v0.1.0 README: 100% / 0% / F1=1.00 / <50ms total).
Isn't: a comparison against any competitor or external benchmark. Latency here (6.8ms/trace) is the full-stack detection path, not the hot-path guard (which benchmarks separately at 0.01ms median, see incident_guard_bench.py and benchmarks/results/incident_*.json).
Next (the τ-bench head-to-head pending work)
- TauBench integration, adapter from TauBench trajectory format →
Trace. Publish TrajEval numbers on the same benchmark Solver-Aided used. - TraceSafe-Bench availability check, arXiv 2604.07223 (April 2026). If released: run against the 12 risk categories. If not: contact authors.
benchmarks/results/compared_table.md, partner-ready table: TrajEval vs best LLM-judge vs ToolGate vs Solver-Aided across detection × precision × recall × latency.
Tracking: [the internal task plan.
Source: benchmarks/results/internal_baseline_2026-04-24.md.
Back to the benchmark index or see the
landing page summary.