All essays✦Engineering

Evals that actually catch regressions before users do.

The eval suite most teams ship with is a confidence-builder, not a regression detector. Here's the structure we use to catch real failures earlier.

By SmartDuke Team·Apr 22, 2026·12 min

Abstract colorful pattern representing eval suite scoring

In brief

Most AI eval suites are confidence-builders, not regression detectors — they sample happy-path inputs and let real failures slip through. Production-grade evals combine three layers: unit checks on tool calls and structured outputs, frozen LLM-as-judge regression sets that block deploys on regressions, and continuous sampling of live traces so drift gets caught before it shows up in user complaints.

Almost every team building an AI product writes evals. Almost none catch the regressions that matter. The eval scripts get written early, pass once, and then quietly fall behind the product — running on stale prompts, missing the edge cases that actually break in production, and giving teams a green light that means nothing.

The eval suites that actually move the needle look different. They're built around three layers, each catching a different class of failure.

Layer 1 — Unit evals on the boring stuff.

Tool calls, structured outputs, JSON shape, citation presence, refusal behavior on out-of-scope queries. These are the failure modes classic test suites can catch — and most teams skip them, treating LLM outputs as too fuzzy to assert on. They're not. The structured shape of an agent's output is testable; the natural-language content is what's fuzzy.

Blue circuit-board pattern representing automated test infrastructure

Layer 2 — LLM-as-judge regression suites.

A frozen test set of representative prompts, scored by a stronger model against an explicit rubric. Run it before every deploy. Block the deploy if scores drop on any high-stakes category. The trick is keeping the test set frozen and the rubric explicit — not vibes-based scoring, not a moving target.

Layer 3 — Continuous production sampling.

Sample 1–2% of live traces, score them with the same rubric, alert on degradation. This is where you catch drift, abuse, edge cases your test set didn't anticipate, and the slow performance creep that production loads cause but staging never reveals.

If your eval suite hasn't caught a regression in the last 60 days, it's not detecting failures — it's just running.

Dense network cabling representing production traffic monitoring

What to ship in week one.

Don't try to build all three layers before launch. Ship layer one (structured assertions) and a 50-prompt frozen test set with a simple rubric in your first two weeks. Add production sampling once real users are on the system. Expand the regression suite as failures teach you what to score.

Evals aren't a destination — they're the discipline that lets you keep changing the product without breaking it.

Filed under

#evals #production #engineering

Next essay

Patterns · 18 min

Five agent-loop patterns we keep reaching for.

Start a project

Have an AI product
that needs to ship?

Tell us where you are — early concept, broken prototype, or scaling something that already works. We'll come back within 24 hours with a take and a quote.

Start a project Explore packages