Evals are NOT unit tests. They are NL ML trainings.

February 15, 2026

AI Evals

"The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts."

We are going exactly in reverse. Evals are NOT unit tests.