Evals Are the New Unit Tests

"It seems better" is not a release criterion. Evals turn vibes into a green checkmark — and catch the regression your demo missed.

· 1 min read
Evals Are the New Unit Tests
📊
Part two, A.I. side. We opened the agent up. Now: how do you know a change made it better and not just different?

Software has a beautiful invention: the failing test that turns red before you ship the bug. AI development threw that away and replaced it with "looks good to me." Evals bring it back.

If you can't measure it, you can't ship it — you can only hope.
If you can't measure it, you can't ship it — you can only hope.

Vibes don't scale

The prompt tweak that fixed Tuesday's bug quietly broke Monday's feature. Without a suite, you'll find out from a user. With one, you find out in CI.

An eval is just a unit test that tolerates being 92% right instead of 100%.

A starter ladder

  1. Golden set — 50–200 real inputs with known-good outputs. Start ugly; start today.
  2. Assertions — exact match where you can, schema/constraint checks where you can't.
  3. LLM-as-judge — for open-ended quality, with a rubric and spot-checks against humans.
  4. Gate the release — score must not drop. Now "it seems better" has a number.
🔭
Series finale: the most visible agent of all — the coding agent, and the end of the blank file.

Read more at Meddler A.I.