Lesson 7 · 9 min

Evaluation — precision, recall, faithfulness

Three metrics that tell you whether your retrieval pipeline is working, with the labelling effort that's actually realistic.

The three metrics

Precision@k. Of the k chunks retrieved, what fraction are actually relevant? The metric for hallucination prevention — irrelevant chunks confuse the LLM.
Recall@k. Of all relevant chunks in the corpus, what fraction did we retrieve in top-k? The metric for coverage — missing chunks mean wrong answers.
Faithfulness. Does the generated answer cite only retrieved chunks? Computed by LLM-as-judge against the chunks. Measures whether the model is grounding properly.

For a typical RAG eval set: 30-50 queries, each with 3-5 hand-labelled relevant chunks. Yes, that's hours of labelling. Worth it; everything else flows from it.