Topic·02 essays

Evals.

How we design eval suites that catch real regressions before users do — unit evals on tool calls, frozen LLM-as-judge regression sets, and continuous production sampling.

Code editor showing structured test cases on a dark screen

✓Engineering

Engineering·10 min

How to write your first AI eval suite without a framework.

You don't need LangSmith, Braintrust, or any platform to ship your first eval suite. Most production-grade evals start as 100 prompts in a JSON file and a script. Here's the playbook.

Read essay

Abstract colorful pattern representing eval suite scoring

✦Engineering

Engineering·12 min

Evals that actually catch regressions before users do.

The eval suite most teams ship with is a confidence-builder, not a regression detector. Here's the structure we use to catch real failures earlier.

Evals.

How to write your first AI eval suite without a framework.

Evals that actually catch regressions before users do.

Keep reading.

Agents & patterns

Guardrails

Production engineering

GEO + AEO