The receipts

Benchmarks, every number, every trace, every reason.

fewwords was run against 1,980 real τ-bench trajectories (Sierra Research, MIT) and head-to-head against Claude Sonnet 4.6 with Opus 4.7 as tiebreaker. Everything below is the primary evidence: markdown writeups with methodology + numbers, raw per-trace JSONL so you can spot-check any individual verdict.

Summary: fewwords 0.14 ms p50 / $0 / deterministic / 100% parseable. Claude Sonnet 4.6 6,146 ms p50 / $0.0155 / hallucinates ~10% / fails JSON 14% of the time. fewwords matches an independent policy audit 80–82%; Claude matches the same audit 54–62%. fewwords had a measured ~28% recall gap on conditional + semantic policy classes; the typed-state postconditions release closes 19 of 72 audited Sonnet-unique flags (26%) at 0.126 ms p50 — see the postconditions writeup.

Writeups

Raw artifacts

Per-trace JSONL outputs. Every record has the fewwords verdict + the judge's reasoning + Opus's tiebreaker when applicable. These are the inputs a partner would spot-check to sanity-check our numbers.

Reproducing

Source is public, MIT. The benchmark harness runs in a clone:

pip install git+https://github.com/abhishek5878/fewwords.git
# Benchmark harness scripts live in the invite-only repo during early access.
# Email abhishekvyas02032001@gmail.com for clone access; the
# raw per-trace JSONL artifacts are publicly downloadable from the
# /benchmarks index above.

The fully-reproducible leaderboard with multi-rater Fleiss' kappa lands by 2026-05-15.

Total cost to re-derive every LLM-judged number: $8.55. The postconditions writeup added zero API spend (deterministic re-run on the same corpus).

Raw per-trace JSONL artifacts (the inputs you'd spot-check to sanity-check our numbers) are downloadable on the index. The benchmark harness scripts that produced these JSONLs ship in the invite-only repo during early access — email for clone access. The fully-reproducible leaderboard with multi-rater Fleiss’ kappa lands by 2026-05-15.