Skip to main content

Evaluation and Grounding

Shipping RAG without metrics is shipping blind. You need separate signals for retrieval and generation, plus end-to-end checks.

Retrieval metrics

On a labeled set of (question, relevant chunk IDs):

  • Recall@k — did any of the top-k retrieved chunks contain the answer?
  • MRR — how high was the first relevant chunk ranked?

If recall is low, fix chunking, embeddings, or hybrid search before blaming the LLM.

Answer metrics

With golden answers or rubric-based scoring:

  • Exact match / token F1 for short factual QA.
  • LLM-as-judge for open-ended help — flaky but useful for regression triage when human labels are expensive.
  • Citation accuracy — does the answer only claim what sources support?

Always spot-check automated scores; models can game proxies.

Grounding behaviors

Require the model to quote or cite retrieved passages. Reject answers that introduce facts not present in context (enforce with structured output checks when possible).

Regression sets

Freeze a benchmark of real user questions with expected behaviors. Run it on every retrieval or prompt change — same idea as software unit tests.

Key takeaways

  • Split retrieval vs generation debugging with targeted metrics.
  • Ground answers in retrieved text and test for drift as data and models change.
  • Invest in a durable eval set — it pays off faster than manual ad-hoc testing.