The receipts

Benchmarks, every number, every trace, every reason.

fewwords was run against 1,980 real τ-bench trajectories (Sierra Research, MIT) and head-to-head against Claude Sonnet 4.6 with Opus 4.7 as tiebreaker. Everything below is the primary evidence: markdown writeups with methodology + numbers, raw per-trace JSONL so you can spot-check any individual verdict.

Summary: fewwords 0.14 ms p50 / $0 / deterministic / 100% parseable. Claude Sonnet 4.6 6,146 ms p50 / $0.0155 / hallucinates ~10% / fails JSON 14% of the time. fewwords matches an independent policy audit 80–82%; Claude matches the same audit 54–62%. fewwords had a measured ~28% recall gap on conditional + semantic policy classes; the typed-state postconditions release closes 19 of 72 audited Sonnet-unique flags (26%) at 0.126 ms p50 — see the postconditions writeup.

Writeups

Head-to-head: fewwords vs Claude Sonnet 4.6

300 τ-bench trajectories, three-way confusion against a rigorous policy audit. $4.66 of API spend.

llm_judge_head_to_head_2026-04-24.md

Opus 4.7 tiebreaker on 52 Claude-unique flags

Is Claude Sonnet catching real semantic violations or hallucinating? Opus 4.7 adjudicates. 47 of 52 confirmed real. $3.89 spent.

opus_tiebreaker_2026-04-24.md

Rigorous independent audit, 300 trajectories

An explicit policy-reading classifier run on raw OpenAI messages, independent of fewwords. 95% Wilson CIs.

rigorous_audit_2026-04-24.md

Full τ-bench run, 1,980 trajectories

After consent-strict mode shipped. 452 trajectories flagged. Latency p50 0.14 ms across airline + retail.

tau_bench_historical_2026-04-24-r2.md

Replay A/B, preventive evaluation

Step-by-step walk of every labeled trace. Earliest-block point classified as prevented / too-late / miss.

tau_bench_replay_ab_2026-04-24.md

Compared vs 2026 arXiv papers

fewwords vs Solver-Aided, ToolGate, TraceSafe. Every cell annotated with its source corpus.

compared_table.md

Internal 20-trace baseline

v0.1.0 regression check. 100% detection, 0% FP, F1=1.00, 6.8 ms/trace. Git SHA pinned.

internal_baseline_2026-04-24.md

Postconditions: typed-state primitive on τ-bench

Typed symbolic state + postcondition contracts. 19 of 72 audited Sonnet-unique flags closed (26%) at 0.126 ms p50 across airline + retail. Two prediction revisions on record.

r4_2b_compared_final.md

Raw artifacts

Per-trace JSONL outputs. Every record has the fewwords verdict + the judge's reasoning + Opus's tiebreaker when applicable. These are the inputs a partner would spot-check to sanity-check our numbers.

claude_unique_flags.jsonl0.1 MB llm_judge_haiku_4_5_2026-04-25.jsonl0.8 MB llm_judge_opus_4_7_2026-04-25.jsonl0.7 MB llm_judge_sonnet_2026-04-24.jsonl0.7 MB llm_judge_sonnet_reruns_2026-04-25.jsonl30 KB opus_tiebreaker_2026-04-24.jsonl0.2 MB rigorous_audit_lenient.jsonl0.3 MB rigorous_audit_strict.jsonl0.3 MB tau_bench_historical_2026-04-24-r2.jsonl0.9 MB tau_bench_historical_2026-04-24-r3.jsonl0.9 MB tau_bench_historical_2026-04-25-r4_2b-final.jsonl1.0 MB tau_bench_replay_ab_2026-04-24-r5.jsonl0.1 MB tau_bench_replay_ab_2026-04-24.jsonl0.1 MB

Reproducing

Source is public, MIT. The benchmark harness runs in a clone:

pip install git+https://github.com/abhishek5878/fewwords.git
# Benchmark harness scripts live in the invite-only repo during early access.
# Email abhishekvyas02032001@gmail.com for clone access; the
# raw per-trace JSONL artifacts are publicly downloadable from the
# /benchmarks index above.

The fully-reproducible leaderboard with multi-rater Fleiss' kappa lands by 2026-05-15.

Total cost to re-derive every LLM-judged number: $8.55. The postconditions writeup added zero API spend (deterministic re-run on the same corpus).