Skip to main content

Lesson 8 · 12 min

Evaluating a RAG pipeline

Measure retrieval and generation separately. Aggregate metrics hide everything.

Two evals, not one

Retrieval eval

  • Recall@k: of the chunks that contain the answer, how many are in the top-k? Most important metric.
  • Precision@k: of the top-k, how many are actually relevant?
  • MRR (mean reciprocal rank): 1/rank of the first relevant chunk, averaged.

You need a labeled set of (query, relevant chunk IDs) pairs. 50 queries is a starting point.

Generation eval

  • Faithfulness: does the answer use only the provided context? (LLM-as-judge: check claims against chunks.)
  • Answer relevance: does it actually answer the question?
  • Groundedness: are cited claims traceable to specific chunks?

Libraries like RAGAS and TruLens automate this.