Lesson 10 · 12 min

Evaluating prompts (the part nobody does)

You're not done when it works once. You're done when it works on a held-out test set.

The discipline most teams skip

Writing a prompt that works on one example is easy. Writing one that holds up on 50 inputs you didn't think of is the actual job.

Minimum viable prompt eval, in order:

Build a dataset of 20–50 inputs — including edge cases, adversarial examples, and the failures you've already seen in the wild.
Define a scoring function — exact match, JSON-validity, regex, LLM-as-judge, or a rubric.
Run the prompt across the dataset. Record pass/fail per case.
Iterate. Change one variable at a time. Re-run.
Track regressions — fixing one case shouldn't break three others.