Lesson 3 · 11 min
Eval-design — the most discriminating round
"How would you know your prompt was good?" is the question that filters most candidates. The full answer in 3 minutes.
The question, the answer
Interviewer: "How would you know your prompt was good?"
Weak answer: "I'd test it on a few examples and check the output looks right."
Strong answer (the full version):
"I'd build a 30-50 case eval set mixing real production traces with synthetic edge cases and known historical failures. The scoring depends on the task — for extraction, I'd use exact-match on JSON-validity plus per-field accuracy; for open-ended generation, an LLM-as-judge against a written rubric, validated against a 20-case human-graded subset to keep the judge honest. I'd run case-level diffs on every prompt change so I catch silent regressions where aggregate accuracy stays flat but three cases got worse and three got better. Per-segment metrics if my data has axes that matter (language, expertise level). And a CI gate so the eval runs on every PR that touches the prompt."
That's the full answer. Memorize the structure; you don't need to deliver it word-for-word.