Lesson 6 · 11 min
Evaluating a fine-tuned model
Train loss going down means *something* is happening. Whether it's the right thing is a separate question.
Three layers of evaluation
1. Loss / perplexity (cheap, dumb)
Track train and validation loss. A growing gap means overfitting. Necessary but not sufficient — low loss doesn't mean good outputs.
2. Task-specific automated metrics
For classification: accuracy, F1. For extraction: exact-match on JSON keys. For format adherence: parse rate. For style: regex/LLM-as-judge against rubrics.
3. Side-by-side qualitative review
Generate from base model and fine-tune on the same prompts. Look at 50 outputs by hand. No automated metric replaces eyeballs for catching subtle regressions in tone, helpfulness, or refusal patterns.