Eval discipline: the cheapest skill that gets engineers hired

Most candidates can write a working prompt. Few can prove it stays working. The 50-case eval set is the leverage point.

The interview question

A senior staff engineer asked me last month: "What's the question that separates a good GenAI candidate from a bad one in 30 seconds?"

His answer: "How did you know your prompt was good?"

Bad answers (rejected):

"I tried it on a few examples and it worked."
"GPT-5 is great, so I trusted the output."
"We had a meeting and decided to ship."

Good answers (next round):

"I built a 50-case evaluation set with edge cases and known production failures, scored on JSON-validity, ran a regression diff against the prior version, and tracked failures per category."
"I used LLM-as-judge against a rubric for the open-ended cases, with a human-validated subset of 20 cases to keep the judge honest."

The gap between the two is eval discipline. It's the cheapest, most learnable skill in the AI engineering stack. And almost nobody teaches it.

The minimum viable eval

You don't need a vendor. You don't need a benchmark suite. You need:

A dataset of 20–50 inputs. Real production cases when you have them, synthetic + adversarial when you don't.
A scoring function. Exact-match, JSON-validity, regex match, LLM-as-judge with a rubric, or a human-graded subset. Pick what fits your task.
A runner. 50 lines of Python. Loop over cases, score, log failures.
A diff. Compare today's failures to yesterday's. Did fixing case #12 break case #34?

That's it. Most teams that ship reliable GenAI in production are doing exactly this — no more, no less.

What an eval catches that humans miss

Three failure modes that only show up in systematic eval:

1. Silent regressions

You change a prompt to fix one bad case. It now fails three different cases that used to work. Aggregate accuracy is unchanged. Without case-level diffing, you ship the regression.

2. Distribution drift

The model is the same. Your inputs aren't. The cases that worked in March break in October because a customer started sending PDFs instead of Markdown. Without daily eval-set runs against a probe, you find out from a support ticket.

3. Edge-case cliffs

The model gets 95% accuracy on your eval set. Three of the five failures are the same kind of input — a quirk you didn't notice. That's a 60% accuracy cliff for that subset. Without per-category metrics, you ship a feature that's broken for 5% of users.

What we built into the curriculum

Every path has at least one eval lesson:

Prompt Engineering, lesson 10 — eval set design, exact-match vs LLM-as-judge, regression discipline.
RAG, lesson 8 — precision@k, recall@k, faithfulness, runnable JS that computes both.
Fine-tuning, lesson 6 — held-out validation, side-by-side eval, capability regression checks.
Deployment & MLOps, lesson 7 — daily probe-set runs, monitoring drift over time.

Each one ends with a code-run beat where you compute metrics by hand on a tiny dataset. Reading is forgetting; computing your own precision@5 is remembering.

How to start tomorrow

Pick a prompt or a small AI feature you ship. Today, before lunch, do the following:

Write down 10 inputs. Include 2–3 you've seen fail.
For each, write down what the correct output looks like.
Write a Python script that runs the prompt on all 10, compares output to expected, prints pass/fail.
Run it. You'll fix at least one bug in your prompt.
Tomorrow, before changing anything, run it again.

You now have an eval. Add cases as you find new failures. In two weeks you'll have 30 cases and a real signal of whether your changes are improvements or regressions.

This is the move that turns "I tried it and it worked" into "I built a 50-case evaluation set..." in your next interview.

The first lesson on building one is at [/app/lesson/pe-10](https://nextgenailearn.com/app/lesson/pe-10). It takes 12 minutes.