Stop your AI agent
from doing
something stupid.
Write the rules once. We block bad calls before they run. Sign every receipt.
LLM-as-judge asks another AI if your AI looked okay. fewwords checks
what your AI actually did against the rules you wrote. No AI grading
the AI. Sub-millisecond. Works with whatever agent stack you’re
already on.
Runs alongside the tools you already own.
Different question, different insertion point.
LLM observability and agent tooling has nine shipping products that grade what happened. fewwords decides what’s allowed to happen. Every buyer asks the same four questions on the call. We put the answers on a page.
validate_payment passes every string check. The trajectory contract catches it.Every team hits the same wall.
The shape never changes.
Agents that pass every LLM-grade eval still skip the one step that matters. It’s not a model problem. It’s an evaluation-unit problem.
LLM-as-judge scores the turn.
We score the trajectory.
Same trace. Two verdicts. Tone is the wrong unit of analysis for autonomous agents. Pre-execution contracts on the tool sequence catch what post-hoc grading never will.
- 6,146 ms median per judgment vs our 0.14 ms. ~44,000× slower.
- Hallucinates on ~10% of its flags (Opus 4.7-corroborated).
- Fails JSON output on 14% of traces despite strict prompt.
- Non-deterministic: same input, different verdict on re-run.
- $0.0155 per trace. At 10M calls/mo that’s $155K.
- Deterministic YAML checked against the tool sequence.
- 0.14 ms median on 1,980 τ-bench traces (full check stack).
- Pre-execution guard in the dispatcher, or blocking in CI.
- Every violation cites the exact node where it fired.
- No LLM in the hot path. 100% JSON-parseable. Same input, same verdict, forever.
Third-party benchmark.
τ-bench, 1,980 trajectories, head-to-head vs Claude Sonnet 4.6.
Sierra Research’s τ-bench is the closest public benchmark the 2026 academic landscape (ToolGate, Solver-Aided, TraceSafe) converges on. We ran every airline + retail trajectory through fewwords, independently audited 300 of them by re-reading raw messages against the policy docs, and compared against Claude Sonnet 4.6 with Opus 4.7 as tiebreaker. All code + results on GitHub; every run reproducible.
| fewwords | Claude Sonnet 4.6 | Claude Opus 4.7 | |
|---|---|---|---|
| Latency p50 | 0.14 ms | 6,146 ms | 6,310 ms |
| Cost per trace | $0.00 | $0.0155 | $0.0748 |
| Audit agreement (300-trace rigorous audit) | 80% | 57% | 82% |
| Precision on flagged violations (audited subset, post-postconditions) | airline 100%, retail 54–75% | not measured here | not measured here |
| Coverage of τ-bench classes (today) | 26% | universal LLM | universal LLM |
| Hallucinates a violation | 0% (deterministic) | ~10% | lower |
| JSON output adherence | 100% (by design) | 86% | 100% |
| Determinism | Yes | No | No |
fewwords misses ~28% of policy violations in the “conditional + semantic” long tail. An Opus 4.7-corroborated pass on 300 traces categorized the gap: 75% is implicit-vs-explicit consent (user says “let’s go with option 1” policy demands literal “yes”), 17% is conditional state (“basic economy cannot be modified”), and 8% is data-flow (agent used data user never provided).
Those three categories map directly to our next three shipped primitives
(strict_consent_only, postconditions + typed state, capability labels).
The right production pattern is fewwords-first with optional LLM-judge
fallback for semantic-only classes, not one-or-the-other.
Send us one real trace. We’ll write the contract.
One sanitized agent trace from your production run. We
read it against your graph.py, write a
tailored .trajeval.yml, and send back a
one-page report showing exactly what it would’ve
caught.
15 minutes of your time. Abhishek reads every submission personally and replies by email — typically within two business days during early access. No contract, no NDA, no commit. Just the writeup.
Same agent, same task. Different trajectory.
Pick a vertical. The left pane shows a trace that LLM-as-judge would happily score “helpful.” The right pane shows what the same agent does when a trajectory contract is enforced at the dispatcher.
Questions a contract answers.
LLM-as-judge never asks.
Every silent-skip incident is an “obvious”
precondition that nobody wrote down. Each question below maps
to an assertion type in .trajeval.yml. One line.
Zero ambiguity at runtime.
validate_payment?submit_payment,
finish, confirm) can’t fire
until its prerequisites have actually run. Blocked at the
dispatcher, not explained in a postmortem.
tool_output.approved == false and the next
action simply can’t fire.
A YAML loader takes an evening.
The reliability engineering below it takes months.
Your best engineer can wrap a check loop around a trace JSON in a sprint. That’s the 10%. The 90% that actually matters to prod (adapters, corpus, guard performance, synthesis), is what’s underneath.
Pick your vertical. Paste a trace. See what’s missing.
Each vertical page preloads 2–4 genuine traces and a matched reference corpus. The “what we noticed” card surfaces tools your paste is missing: the silent-skip finding your judge grades green.
12 real production incidents. Reconstructed. Each one catchable.
Every incident on /prove ships with a public postmortem link, the reconstructed trace JSON, and the 1–14 line YAML contract that catches it. Real failures, reconstructed transparently, not observed in a customer log. Sophisticated buyers ask the question; the answer is in the repo.
- Multi-rater Fleiss’ kappa. The audit was done by the founder. External blind raters are being commissioned; Fleiss’ kappa across all three publishes by 2026-05-15, before the public leaderboard launches.
- Underwriter LOI. No paid LOI from an insurance underwriter yet. Insurance integration is a forecast, not a booking.
- “100% recall on covered classes” — we used to cite this without context. It’s technically true (the recall on rule-expressible classes is 100%) but read alone reads tautological. The audit-agreement number above (80%) and the precision (69.5%) are the honest framing: of 300 traces, fewwords agrees with the auditor on 239 verdicts; of 200 it flags, 139 are policy-correct and 61 are over-flags — mostly the
tool_repeatsafety-net layer firing on retry-storms that aren’t themselves policy violations.
The agent that ran DROP DATABASE on production.
A Replit coding agent wiped the production Postgres during routine maintenance. The destructive call was the third tool in the trace. Nothing in the agent’s loop blocked it. Tone grade, had one been run, would have been fine. The agent was articulate about what it was doing.
The contract is two lines of YAML:
banned_tools: [drop_database, delete_environment] dangerous_input_patterns: execute_sql: ['\bDROP\s+TABLE\b', '\bDROP\s+DATABASE\b']
Checked in 0.01 ms. Non-zero exit from the dispatcher. Postmortem never written. Full trace + contract on /prove.
Drop into the dispatcher. Ship a contract.
The runtime guard is the real surface: a sub-millisecond pre-execution check that blocks the bad call before it fires. Hosted playground below is for when you just want to see the shape.
curl -X POST https://fewword-ai.fly.dev/v1/analyze -H 'Content-Type: application/json' -d @trace.json
Not ready yet?
Leave an email. We’ll ping you when PyPI ships, when the LangGraph auto-hook lands, or when a major vertical template drops. No marketing blast.