Skip to main content

Retrieval-Augmented Generation (RAG)

RAG answers questions or completes tasks by retrieving relevant documents first, then generating with the model conditioned on those passages. It grounds answers in your data without baking everything into weights.

The basic pipeline

  1. Ingest documents from files, CMS, tickets, or tickets + PDFs.
  2. Chunk text into pieces that fit the embedding and context window budgets.
  3. Embed each chunk; store vectors with metadata (source ID, title, URL, permissions).
  4. Query — embed the user question; retrieve top-k neighbors.
  5. Generate — pass the question plus retrieved chunks to the LLM with citations or strict answer boundaries.

Why RAG instead of fine-tuning

ApproachBest when
RAGFacts change often; you need citations; cold start is fast
Fine-tuningBehavior or format is stable; you have clean, sizable labeled data

Most products start with RAG plus prompt engineering; fine-tune later if the base model resists your format.

Failure modes

  • Wrong chunk granularity — too small loses context; too large dilutes relevance.
  • Stale index — outdated docs stay retrieved until re-ingestion runs.
  • Permission leaks — if retrieval ignores ACLs, users see others' data. Filter by tenant and role at query time.

Key takeaways

  • RAG separates memory (your index) from reasoning (the LLM).
  • Ingestion, chunking, and metadata are as important as model choice.
  • Always enforce access control in the retrieval layer, not only in the UI.