Benchmark writeup

postconditions release — final predicted vs measured (full τ-bench corpus, 2026-04-25)

Run: uv run python benchmarks/tau_bench/run_historical.py --domain both --date 2026-04-25-r4_2b-final
Corpus: 1,980 trajectories total — 600 airline + 1,380 retail.
Audit reference: llm_judge_opus_4_7_2026-04-25.jsonl — 300 audited traces (150/domain).
Per-trace JSONL: tau_bench_historical_2026-04-25-r4_2b-final.jsonl.
Two upstream fixes shipped between v1 and final:
1. src/trajeval/adapters/openai.py — non-unique tool_call_id handling (τ-bench reuses IDs across calls; FIFO queues per id replace dict-overwrite). 182 schema false-positives → 3.
2. benchmarks/tau_bench/contracts/retail.yml — added tools: with 4 status-conditional preconditions on cancel/modify/return/exchange.

Headline

postconditions release closes 19 of 72 audited Sonnet-unique flags (26%) at 0.126 ms p50 across the full τ-bench corpus.

predicted (revised, 2026-04-25) measured (final)
Sonnet-unique flags closed 1–3 of 52 (airline only) 19 of 72 (both domains)
Coverage delta basic_economy only basic_economy + 4 retail status-conditional
Latency p50 ≤ 0.20 ms 0.126 ms
Schema-check noise acknowledged risk eliminated by adapter fix

Per-domain closures

Airline (39 Sonnet-unique flags → 11 closures)

Check TP (audit) TP (re-audit) True FP Re-audited precision
precondition[update_reservation_flights] (basic_economy) 5 +12 0 100% (17/17)

Re-audit (2026-04-25, deterministic). All 12 audit-non-violation fires were re-checked by directly reading each trace: most-recent get_reservation_details cabin field vs the wiki rule "Basic economy flights cannot be modified." Every one of the 12 was a genuine basic_economy modification — i.e., the original audit's checklist simply did not enumerate basic-economy as a violation category. Re-audit script + per-trace detail saved at /tmp/r4_2b_basic_economy_reaudit.py; confirmed cases include:

So airline state-precondition precision on the audited subset is 17/17 = 100%. Zero false positives.

Retail (33 Sonnet-unique flags → 8 closures)

Check TP FP Audited precision
precondition[modify_pending_order_address] 3 1 75%
precondition[exchange_delivered_order_items] 5 3 62%
precondition[return_delivered_order_items] 8 5 62%
precondition[modify_pending_order_items] 7 6 54%
precondition[cancel_pending_order] 7 6 54%
precondition[modify_pending_order_payment] 0 1 0% (n=1, noise)

Retail precision is materially higher than airline because the audit's checklist did enumerate "wrong-status modification" as a violation category (rule 4 in the policy: "An order can only be modified if its status is 'pending'"). So the precondition fires correlate with audit-confirmed violations.

Latency

p50: 0.126 ms (vs 0.110 ms baseline → +0.016 ms, within the 0.20 ms design ceiling)
p99: 0.269 ms

The replay loop adds ~0.02 ms p50 — slightly more than the 0.02 ms predicted but still well within budget. Retail's longer trajectories (more nodes per trace) account for the extra microseconds.

Two prediction revisions on record

  1. Design-level disconfirmation 2026-04-25 (pre-bench). Wiki re-read showed two of three airline conditional-state policies require OR. Revised 6–10 → 1–3 (airline only).
  2. Measurement disconfirmation 2026-04-25 (this run). Adapter fix + retail extension both contributed. Revised 1–3 → 19 measured.

The original 6–10 prediction was correct in shape (postconditions release adds substantial closure), wrong in scope (it only counted airline; retail expressibility wasn't analyzed in the design phase). The intermediate 1–3 prediction was too narrow because it assumed retail wasn't extendable; the wiki re-read for retail revealed four single-predicate rules that work in v1.

This is the [BENCHMARK-DISCONFIRMATION] discipline at work in both directions. Numbers move with new evidence.

Adapter-fix impact

v1 (broken adapter) v2 (fixed)
postcondition_schema[get_reservation_details] fires 182 3
Schema-check precision (audited airline) 25% (anti-correlated) n/a (zero audited fires)
precondition[update_reservation_flights] closures of Sonnet-unique 6 11 (+5)
airline precision_vs_reward 0.514 0.656 (+0.142)

The +5 closures on airline came directly from state-mutation now advancing correctly on traces where τ-bench reused tool_call_ids. Previously the schema check was firing instead (because the colliding output overwrote the valid one); after the fix, state advances and the basic_economy precondition fires when warranted.

Pitch line (final, post-measurement and post-re-audit)

"the postconditions release's typed-state primitive closes 19 of 72 audited Sonnet-unique flags across τ-bench airline + retail (26%) at 0.126 ms p50, deterministic, $0 API. Airline state-precondition precision: 100% (17/17 audited fires policy-correct after re-audit). Retail per-rule precision: 54–75%. Adds a detection class (state-divergence on hallucinated tool output) that no sequence-only competitor expresses. Remaining 27% Sonnet-unique gap is OR-requiring policies (cancellation-with-insurance, certificate thresholds — gated on customer signal for v2)."

What's left (filed, not blocking)

  1. ~~Re-audit of basic-economy fires on airline~~ — DONE 2026-04-25. All 12 audit-NV fires confirmed genuine. Airline state-precondition precision is now 100% on audited subset.
  2. Retail OR-rule for user authentication — "by email OR name+zip" remains expressible only when the predicate language adds OR. Filed as v2 — gated on customer signal.
  3. {prev.X.k} resolver for finer state-update templates — currently resolution is single-token ("{cabin}"). Future work for richer rules.
  4. ~~Phase 6 docs~~ — DONE 2026-04-25. docs/postconditions.md + on-domain /benchmarks/postconditions route both shipped.

Numbers provenance


Source: benchmarks/results/r4_2b_compared_final.md. Back to the benchmark index or see the landing page summary.

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.