Benchmarks, every number, every trace, every reason.
fewwords was run against 1,980 real τ-bench trajectories (Sierra Research, MIT) and head-to-head against Claude Sonnet 4.6 with Opus 4.7 as tiebreaker. Everything below is the primary evidence: markdown writeups with methodology + numbers, raw per-trace JSONL so you can spot-check any individual verdict.
Summary: fewwords 0.14 ms p50 / $0 / deterministic / 100% parseable. Claude Sonnet 4.6 6,146 ms p50 / $0.0155 / hallucinates ~10% / fails JSON 14% of the time. fewwords matches an independent policy audit 80–82%; Claude matches the same audit 54–62%. fewwords had a measured ~28% recall gap on conditional + semantic policy classes; the typed-state postconditions release closes 19 of 72 audited Sonnet-unique flags (26%) at 0.126 ms p50 — see the postconditions writeup.
Writeups
Raw artifacts
Per-trace JSONL outputs. Every record has the fewwords verdict + the judge's reasoning + Opus's tiebreaker when applicable. These are the inputs a partner would spot-check to sanity-check our numbers.
Reproducing
Source is public, MIT. The benchmark harness runs in a clone:
pip install git+https://github.com/abhishek5878/fewwords.git
# Benchmark harness scripts live in the invite-only repo during early access.
# Email abhishekvyas02032001@gmail.com for clone access; the
# raw per-trace JSONL artifacts are publicly downloadable from the
# /benchmarks index above.
The fully-reproducible leaderboard with multi-rater Fleiss' kappa lands by 2026-05-15.
Total cost to re-derive every LLM-judged number: $8.55. The postconditions writeup added zero API spend (deterministic re-run on the same corpus).