Pre-launch · Open source · Solo founder, shipping

Stop your AI agent
from doing
something stupid.

Write the rules once. We block bad calls before they run. Sign every receipt.
LLM-as-judge asks another AI if your AI looked okay. fewwords checks what your AI actually did against the rules you wrote. No AI grading the AI. Sub-millisecond. Works with whatever agent stack you’re already on.

Paste a trace, see what we’d block → Send us one real trace

For teams whose AI agent already did something stupid in production. Or whose AI is one bad week away from it. If that’s not you yet, bookmark this and come back after your first postmortem.

0.14 ms median check · 80% audit agreement on Sierra’s τ-bench (vs 57% for Claude Sonnet 4.6) · $0 API cost per check · MIT

Works with every agent stack you’re already on

OpenAI Anthropic LangGraph OpenTelemetry LlamaIndex AutoGen

The landscape

Runs alongside the tools you already own.
Different question, different insertion point.

LLM observability and agent tooling has nine shipping products that grade what happened. fewwords decides what’s allowed to happen. Every buyer asks the same four questions on the call. We put the answers on a page.

LangSmith · Langfuseobservability

They log the run. We block it.

Post-hoc dashboards for LangChain / LangGraph / OTel spans. Pipe them straight into fewwords and keep both.

Datadog · ArizeAPM · ML

Enterprise alert stack. We’re the gate.

Datadog and Arize see anomalies after the tool call. fewwords blocks the dispatch in 0.01 ms. SOC 2 dashboard + pre-exec contract.

DSPy · Weavebuild time

They optimise. We constrain.

DSPy tunes the pipeline against metrics at build time. fewwords holds the pipeline to its contract at runtime.

Guardrails · NeMoper-turn rails

They check one turn. We check the sequence.

String validators for a single prompt in/out. A lazy agent skipping validate_payment passes every string check. The trajectory contract catches it.

The one-line summary. Observability grades what happened. fewwords decides what’s allowed to happen. It’s not a replacement, it’s a different insertion point — see the full LangSmith / Langfuse / Datadog / Arize / OpenTelemetry / DSPy / Helicone / Weave / PromptLayer comparison grid in the FAQ.

The pattern

Every team hits the same wall.
The shape never changes.

Agents that pass every LLM-grade eval still skip the one step that matters. It’s not a model problem. It’s an evaluation-unit problem.

01launch

Ship the agent.

Works 80% of the time. Happy path is clean. Demo goes great.

02eval

LLM-as-judge says every turn is “helpful.”

The judge grades tone, not correctness. Tone is fine. Incorrect trajectories slip through scoring green.

03incident

Agent silently skips validation.

Payment goes through. Or fraud check. Or HITL approval. Trace looks polite; outcome is wrong.

04scramble

Add more judges. False-positive storm.

LLM grading LLM compounds noise. Alerts get muted. Next class of the same failure slips through.

05trajeval

Write the contract. Block at the dispatcher.

1–14 lines of YAML. Checked in 0.01 ms. Non-zero exit blocks CI. The bad path literally can’t ship.

The insight

‘helpful’ isn’t a contract.

LLM-as-judge scores the turn.
We score the trajectory.

Same trace. Two verdicts. Tone is the wrong unit of analysis for autonomous agents. Pre-execution contracts on the tool sequence catch what post-hoc grading never will.

LLM-as-judge (measured, not theory)

Claude Sonnet 4.6 on 300 τ-bench traces

6,146 ms median per judgment vs our 0.14 ms. ~44,000× slower.
Hallucinates on ~10% of its flags (Opus 4.7-corroborated).
Fails JSON output on 14% of traces despite strict prompt.
Non-deterministic: same input, different verdict on re-run.
$0.0155 per trace. At 10M calls/mo that’s $155K.

fewwords

Enforces trajectory contracts

Deterministic YAML checked against the tool sequence.
0.14 ms median on 1,980 τ-bench traces (full check stack).
Pre-execution guard in the dispatcher, or blocking in CI.
Every violation cites the exact node where it fired.
No LLM in the hot path. 100% JSON-parseable. Same input, same verdict, forever.

The receipts

Third-party benchmark.
τ-bench, 1,980 trajectories, head-to-head vs Claude Sonnet 4.6.

Sierra Research’s τ-bench is the closest public benchmark the 2026 academic landscape (ToolGate, Solver-Aided, TraceSafe) converges on. We ran every airline + retail trajectory through fewwords, independently audited 300 of them by re-reading raw messages against the policy docs, and compared against Claude Sonnet 4.6 with Opus 4.7 as tiebreaker. All code + results on GitHub; every run reproducible.

	fewwords	Claude Sonnet 4.6	Claude Opus 4.7
Latency p50	0.14 ms	6,146 ms	6,310 ms
Cost per trace	$0.00	$0.0155	$0.0748
Audit agreement (300-trace rigorous audit)	80%	57%	82%
Precision on flagged violations (audited subset, post-postconditions)	airline 100%, retail 54–75%	not measured here	not measured here
Coverage of τ-bench classes (today)	26%	universal LLM	universal LLM
Hallucinates a violation	0% (deterministic)	~10%	lower
JSON output adherence	100% (by design)	86%	100%
Determinism	Yes	No	No

Honest disclosure, the 28% gap, named

fewwords misses ~28% of policy violations in the “conditional + semantic” long tail. An Opus 4.7-corroborated pass on 300 traces categorized the gap: 75% is implicit-vs-explicit consent (user says “let’s go with option 1” policy demands literal “yes”), 17% is conditional state (“basic economy cannot be modified”), and 8% is data-flow (agent used data user never provided).

Those three categories map directly to our next three shipped primitives (strict_consent_only, postconditions + typed state, capability labels). The right production pattern is fewwords-first with optional LLM-judge fallback for semantic-only classes, not one-or-the-other.

Every row traces to a per-trace artifact. Read the evidence.

Head-to-head vs Claude Sonnet 4.6 → Opus 4.7 tiebreaker (47/5) → Independent audit + 95% CIs → All benchmarks + raw JSONL →

Writeups hosted here. Every per-trace JSONL is downloadable, each record has fewwords’s verdict, Sonnet’s reasoning, and Opus’s adjudication side-by-side. Source is early-access, email for a repo invite to rerun the harness. $0 to verify fewwords; $8.55 to replay Sonnet + Opus.

White-glove · free while pre-PyPI

Send us one real trace. We’ll write the contract.

One sanitized agent trace from your production run. We read it against your graph.py, write a tailored .trajeval.yml, and send back a one-page report showing exactly what it would’ve caught.

15 minutes of your time. Abhishek reads every submission personally and replies by email — typically within two business days during early access. No contract, no NDA, no commit. Just the writeup.

1 trace in · 1 YAML + hand-authored writeup out · reply by email, ~2 business days

See it

Same agent, same task. Different trajectory.

Pick a vertical. The left pane shows a trace that LLM-as-judge would happily score “helpful.” The right pane shows what the same agent does when a trajectory contract is enforced at the dispatcher.

Without fewwords

Judge verdict

With fewwords

Contract verdict

Open this vertical in the full playground →

The hard questions

Questions a contract answers.
LLM-as-judge never asks.

Every silent-skip incident is an “obvious” precondition that nobody wrote down. Each question below maps to an assertion type in .trajeval.yml. One line. Zero ambiguity at runtime.

What happens when the model confidently skips validate_payment?

requires_prior_work

A completion-style tool (submit_payment, finish, confirm) can’t fire until its prerequisites have actually run. Blocked at the dispatcher, not explained in a postmortem.

What happens when HITL says no and the agent acts anyway?

gates · output-conditional

Ordering alone misses denial-bypass. The approval call happened, just the wrong way. Gate on tool_output.approved == false and the next action simply can’t fire.

Why this isn’t a weekend project

A YAML loader takes an evening.
The reliability engineering below it takes months.

Your best engineer can wrap a check loop around a trace JSON in a sprint. That’s the 10%. The 90% that actually matters to prod (adapters, corpus, guard performance, synthesis), is what’s underneath.

01corpus

12 reconstructed production incidents.

Replit DROP DATABASE, Air Canada bereavement-policy hallucination, Amazon Kiro environment delete, Autogen retry storm. Each one hours of reading postmortems and reconstructing the trace shape. Every YAML contract on /prove ships with the source trace.

02benchmark

59 external traces · 0 false positives.

The precision number. We run the same contract stack across traces harvested from public LangGraph / OpenAI / AutoGen repos and track how often we fire when we shouldn’t. A noisy guard gets muted inside a week. Every false positive is a step toward alert fatigue and a team that ignores the next real violation. Precision at zero isn’t a bragging point, it’s the only state where the guard stays in the hot path. And that curve doesn’t land at zero by accident; it lands there after tuning against real trace shapes.

03runtime

0.01 ms median, P99 under 0.05 ms.

The guard is in the hot path. Users don’t feel it. Getting a check engine from “naive Python loop” to sub-millisecond p99 takes days of profiling, careful data structures, and lazy assertion evaluation.

Try it on your kind of agent

Pick your vertical. Paste a trace. See what’s missing.

Each vertical page preloads 2–4 genuine traces and a matched reference corpus. The “what we noticed” card surfaces tools your paste is missing: the silent-skip finding your judge grades green.

Sales / GTM

signal → message

detect → enrich → score → route → message

Agent detects funding signal, writes message, skips enrichment. Rep gets low-personalization blast.

Try the sales-agent page →

Code agent

plan → execute

read → edit → test → verify → commit

Agent edits, commits, skips running the test. PR lands red. Worst case: DROP DATABASE in prod (Replit, July 2025).

Try the code-agent page →

Support

resolve → approve

lookup → decide → request-HITL → action

Agent requests human approval, is denied, runs the refund anyway. Ordering checks alone can’t catch it.

Try the support-agent page →

Data pipeline

validate → transform

ingest → validate → transform → load

Upstream schema drifts; agent transforms null fields to empty strings. Pipeline “succeeds.”

Try the data-pipeline page →

RAG agent

retrieve → cite

retrieve → synthesize → cite → reply

Agent answers without retrieving. Or cites docs it never pulled. Reply is polished; citation is hollow.

Try the rag-agent page →

Proof

12 real production incidents. Reconstructed. Each one catchable.

Every incident on /prove ships with a public postmortem link, the reconstructed trace JSON, and the 1–14 line YAML contract that catches it. Real failures, reconstructed transparently, not observed in a customer log. Sophisticated buyers ask the question; the answer is in the repo.

80%

Audit agreement on τ-bench

300-trace rigorous audit, hand-graded vs the wiki policy. Claude Sonnet 4.6 as judge: 57% on same corpus. Of 200 fewwords-flagged traces, 139 are policy-violations — precision 69.5%, recall 100% on covered classes. Source: llm_judge_sonnet_2026-04-24.jsonl.

0.14ms

Median runtime, full check stack

1,980 τ-bench trajectories, in-process Python, $0 API. No LLM calls in the hot path. Source: tau_bench_historical JSONL.

12/12

Real incidents caught

Replit DROP DATABASE, Air Canada hallucinated policy, Kiro production delete, AutoGen cost overrun, HITL bypass, prompt injection, +6. Each with postmortem link + reconstructed YAML on /prove.

26%

τ-bench class coverage today

19 of 72 audited Sonnet-unique flags closed by typed-state postconditions. Airline precision 100% (17/17), retail 54–75%. Goal: 60% by year-end. Source: /benchmarks/postconditions.

Numbers we don’t cite (yet, and why)

Multi-rater Fleiss’ kappa. The audit was done by the founder. External blind raters are being commissioned; Fleiss’ kappa across all three publishes by 2026-05-15, before the public leaderboard launches.
Underwriter LOI. No paid LOI from an insurance underwriter yet. Insurance integration is a forecast, not a booking.
“100% recall on covered classes” — we used to cite this without context. It’s technically true (the recall on rule-expressible classes is 100%) but read alone reads tautological. The audit-agreement number above (80%) and the precision (69.5%) are the honest framing: of 300 traces, fewwords agrees with the auditor on 239 verdicts; of 200 it flags, 139 are policy-correct and 61 are over-flags — mostly the tool_repeat safety-net layer firing on retry-storms that aren’t themselves policy violations.

Case in point · Replit · July 2025

The agent that ran DROP DATABASE on production.

A Replit coding agent wiped the production Postgres during routine maintenance. The destructive call was the third tool in the trace. Nothing in the agent’s loop blocked it. Tone grade, had one been run, would have been fine. The agent was articulate about what it was doing.

The contract is two lines of YAML:

banned_tools: [drop_database, delete_environment]
dangerous_input_patterns:
  execute_sql: ['\bDROP\s+TABLE\b', '\bDROP\s+DATABASE\b']

Checked in 0.01 ms. Non-zero exit from the dispatcher. Postmortem never written. Full trace + contract on /prove.

Install

Drop into the dispatcher. Ship a contract.

The runtime guard is the real surface: a sub-millisecond pre-execution check that blocks the bad call before it fires. Hosted playground below is for when you just want to see the shape.

Runtime guard in your dispatcher0.01 ms median · P99 < 0.05 ms

# source is early-access, email abhishekvyas02032001@gmail.com for a repo invite from trajeval.guard import check from trajeval.action import load_config config = load_config(".trajeval.yml") def safe_dispatch(tool_name, tool_input): decision = check(history, {"tool_name": tool_name, "tool_input": tool_input}, config) if not decision.allow: raise RuntimeError(decision.messages[0]) return run_tool(tool_name, tool_input)

Or try it hostedno install · paste a trace

curl -X POST https://fewword-ai.fly.dev/v1/analyze -H 'Content-Type: application/json' -d @trace.json

Not ready yet?

Leave an email. We’ll ping you when PyPI ships, when the LangGraph auto-hook lands, or when a major vertical template drops. No marketing blast.

Stop your AI agentfrom doingsomething stupid.

Runs alongside the tools you already own.Different question, different insertion point.

Every team hits the same wall.The shape never changes.

LLM-as-judge scores the turn.We score the trajectory.

Third-party benchmark.τ-bench, 1,980 trajectories, head-to-head vs Claude Sonnet 4.6.