Lesson 7 · 11 min

Debugging exercises — broken prompts, broken evals, broken pipelines

"Here's a prompt and an eval set. The eval shows 60% accuracy. Find the bug." The triage approach that works for any AI debugging brief.

The triage order

When given a broken prompt + eval, work in this order:

Read 3 failing cases by hand. Don't aggregate; don't theorize. Just read. The bug is usually visible in 2 minutes.
Bucket the failures. Are they all the same kind? Different kinds? One bucket usually has 60-80% of the failures.
Diagnose the dominant bucket. Is it: schema violation? Missing-field handling? Prompt-injection success? Format drift? Length cap?
One targeted fix. Change one thing, re-run, re-bucket.
Per-case diff. Did the fix make any previously-passing cases now fail? If yes, balance the trade-off explicitly.

This is the pattern that scales from 60% → 90% accuracy reliably. The anti-pattern is rewriting the whole prompt and hoping.