Evals are NOT unit tests. They are NL ML trainings.
"The dominant approach to creating ML systems is to collect a dataset of training examples demonstrating correct behavior for a desired task, train a system to imitate these behaviors, and then test its performance on independent and identically distributed (IID) held-out examples. This has served well to make progress on narrow experts."
We are going exactly in reverse. Evals are NOT unit tests.