All topics
Topic·02 essays
Evals.
How we design eval suites that catch real regressions before users do — unit evals on tool calls, frozen LLM-as-judge regression sets, and continuous production sampling.
✓EngineeringEngineering·10 min
How to write your first AI eval suite without a framework.
You don't need LangSmith, Braintrust, or any platform to ship your first eval suite. Most production-grade evals start as 100 prompts in a JSON file and a script. Here's the playbook.
Read essay
✦EngineeringEngineering·12 min
Evals that actually catch regressions before users do.
The eval suite most teams ship with is a confidence-builder, not a regression detector. Here's the structure we use to catch real failures earlier.
Read essayRelated topics