Lesson 3 · 12 min

LLM-as-judge — when and how

Deterministic evals can't grade open-ended responses. LLM-as-judge fills that gap — but it introduces bias, cost, and hallucinated scores. How to use it correctly, calibrate it, and know when to distrust it.

Why you need LLM-as-judge

For binary classification or structured output, a string-match eval is enough. But for open-ended generations — summaries, explanations, code reviews, creative copy — you can't write a deterministic rule that captures 'is this good?'.

LLM-as-judge delegates that judgment to a model. It's expensive and imperfect, but for complex outputs it's the only scalable option.