Lesson 8 · 12 min
Evaluating a RAG pipeline
Measure retrieval and generation separately. Aggregate metrics hide everything.
Two evals, not one
Retrieval eval
- Recall@k: of the chunks that contain the answer, how many are in the top-k? Most important metric.
- Precision@k: of the top-k, how many are actually relevant?
- MRR (mean reciprocal rank): 1/rank of the first relevant chunk, averaged.
You need a labeled set of (query, relevant chunk IDs) pairs. 50 queries is a starting point.
Generation eval
- Faithfulness: does the answer use only the provided context? (LLM-as-judge: check claims against chunks.)
- Answer relevance: does it actually answer the question?
- Groundedness: are cited claims traceable to specific chunks?
Libraries like RAGAS and TruLens automate this.