What my benchmark doesn’t measure (yet).

The honest answer to the question every careful reader asks: our drift detection today is check-fire-rate, not learned trajectory shape.

The pitch sometimes makes it sound like we already have a per-agent learned distribution of “normal” trajectories. We don’t. Not yet. Here’s the exact gap, what I’m doing about it, and why I’m saying it in public.

The claim I sometimes make

fewwords catches drift in your agent’s behaviour.

The claim I can actually back

fewwords catches statistically significant changes in the rate at which your contracts fire, batch over batch.

These two claims are not the same. The first is a richer, per-agent learned baseline that notices the shape of an agent’s typical trajectory. The second is a simpler, aggregate measure: if your rule fired on 3% of traces last week and 14% this week with p < 0.01 and an absolute delta of 11 percentage points, the CLI exits non-zero and your CI blocks the release.

The second is real and shipping. The first is on the roadmap, branded internally as AP4, and not yet built.

Why the gap isn’t critical yet

For the first wave of customers, check-fire-rate is enough. The failures we catch in the corpus are blunt: banned tools, missing prior-work, schema-thin outputs, order inversions. You do not need a per-agent learned distribution to catch DROP DATABASE. You need a banned-tools list.

The per-agent learned baseline becomes important later, when the failures the customer cares about are subtler than “agent called a banned tool.” “Our agent used to take 6 tool calls on average for this workflow and now takes 11, and nothing is outright wrong, but something has changed.” That’s the shape-of-trajectory question, and that’s where AP4 lands.

What I’d build

Three pieces.

Sequence distribution per agent. A per-(agent_id, workflow_type) distribution over typical tool sequences. Not just frequency — position in the sequence, pairwise lag between tools, depth of the call tree. You want it to be the kind of distribution you can compute from 5k traces and update online.

Drift as distributional distance, not count. Once you have the distribution, drift is a distance between this week’s empirical distribution and last week’s, with per-class breakdowns for which dimensions are moving. MMD between sequence embeddings, or KL between discretised trajectories. You pick the one that’s robust on your data.

Anomaly as likelihood, not rule. For any individual trajectory, the likelihood under the baseline is a score. Low-likelihood traces go to a human-review queue with the top-3 nearest-baseline matches attached. This is the feature that makes The Ledger useful instead of overwhelming at volume.

Why I’m not there yet

Two reasons.

One. The right thing to do when you’re pre-revenue is to ship the blunt primitive that catches 80% of the cases a customer has right now, then use the revenue to fund the subtler primitive that catches the long tail. Shipping a learned per-agent baseline against a customer corpus of ten traces is noise.

Two. AP4 needs labelled data at scale. The check-fire-rate mechanism produces exactly that: every override, every false-positive flag, every “mark as intended” is a label. The right order is ship the blunt thing, earn the labels, then train on them. Running that in reverse is how research prototypes stay research prototypes.

Why saying this publicly is the right move

Senior engineers can smell overclaiming. If the pitch says “we catch drift” and the engineer reads the code and sees check-fire-rate, that engineer stops reading the rest of the pitch. Everything after the first caught lie is discounted. A founder who pre-emptively names the limit is one you read the rest of.

There is also a second-order effect. When you say the thing in writing, you commit to the timeline. AP4 isn’t drifting in a Notion doc; it’s on the record, with a public deadline, and if it’s not shipped by Q3 I owe you an essay about why.

The honest summary

What I have: blunt, deterministic, sub-millisecond trajectory contracts + batch-level fire-rate drift detection.
What I don’t have yet: a per-agent learned distribution of normal trajectories with per-trace anomaly scores.
What I’m building next: exactly that, from labels the current system produces.
What’s already enough: everything in the incident corpus. Replit through Kiro, fourteen for fourteen, 0.01 ms.

If your failure mode is in the corpus, the blunt primitive catches it today. If your failure mode is subtler, tell me and I’ll prioritise.

Got a production trace that deserves this treatment? Send it. I answer every email at abhishekvyas02032001@gmail.com. Free dossier for the first five teams.

Paste a trace → More posts Read the FAQ

The claim I sometimes make

The claim I can actually back

Why the gap isn’t critical yet

What I’d build

Why I’m not there yet

Why saying this publicly is the right move

The honest summary

More from the blog

Fourteen postmortems, fourteen YAMLs.

$8.55 to run Claude Sonnet 4.6 as a trajectory judge. Here are the numbers.

Four 2026 papers proved deterministic trajectory verification works. None of them ship.

Why YAML. An engineering argument against DSLs, GUIs, and AI-writes-AI.