RAG & Vector DatabasesBack

50 XP

Lesson 8 · 12 min

Evaluating a RAG pipeline

Measure retrieval and generation separately. Aggregate metrics hide everything.

Two evals, not one

Retrieval eval

Recall@k: of the chunks that contain the answer, how many are in the top-k? Most important metric.
Precision@k: of the top-k, how many are actually relevant?
MRR (mean reciprocal rank): 1/rank of the first relevant chunk, averaged.

You need a labeled set of (query, relevant chunk IDs) pairs. 50 queries is a starting point.

Generation eval

Faithfulness: does the answer use only the provided context? (LLM-as-judge: check claims against chunks.)
Answer relevance: does it actually answer the question?
Groundedness: are cited claims traceable to specific chunks?

Libraries like RAGAS and TruLens automate this.

1 / 4 · 0/1 checks