Skip to main content

Chunking and Retrieval Quality

Bad chunking makes RAG look stupid: the right answer exists in your corpus but never surfaces in the top results.

Chunk size and overlap

There is no universal token count:

  • Smaller chunks — precise hit for keyword-like queries; risk losing surrounding context.
  • Larger chunks — better local context; noisy similarity scores.

Overlap (repeating the last n tokens between consecutive chunks) reduces boundary cuts mid-paragraph. Typical starting points: a few hundred tokens per chunk with modest overlap, then measure.

Structure-aware splitting

Prefer splitting on headings, markdown sections, or HTML structure over raw character counts when content is hierarchical. Code may need different rules than prose (preserve function boundaries).

Metadata for filtering

Store fields you will filter on at query time:

  • project_id, workspace_id, user_id for multi-tenant isolation.
  • doc_type, version, language.
  • updated_at for freshness boosts or decay.

Combine vector similarity with keyword (BM25) search. Many systems merge scores or run parallel queries — vectors catch paraphrases; keywords catch exact SKUs and error codes.

Re-ranking

Retrieve a larger candidate set (say 50–200), then re-rank with a cross-encoder or lightweight reranker API to improve top-k precision before generation.

Key takeaways

  • Chunking is a product decision — tune with evals, not intuition alone.
  • Metadata and ACL filters belong in the retrieval path.
  • Hybrid retrieval plus reranking fixes many "almost found it" issues.