3. Chunking strategies

Chunking is the highest-leverage cheap optimization in RAG. Most "our RAG sucks" customer complaints trace back to bad chunks. Reason: the chunk is the unit of retrieval. If your chunks are wrong-shaped, no embedding model, reranker, or LLM can save you.

The fundamental tension

Two forces pulling opposite directions:

There is no universal right answer. Right chunk size depends on document type and query type.

Strategy 1: Fixed-size chunking (the naive baseline)
Strategy 2: Recursive character splitting
Strategy 3: Structure-aware (semantic-boundary) chunking
Strategy 4: Semantic chunking
Strategy 5: Late chunking / contextual chunking
Strategy 6: Contextual retrieval (Anthropic's variant, also adopted broadly)

Overlap — the parameter everyone gets wrong

Overlap means chunks share some tokens at the boundary, so an idea that straddles a chunk break appears in both chunks.

Too little overlap → boundary information lost
Too much overlap → duplicate chunks waste storage, retrieval returns redundant context

Typical: 10–20% of chunk size. 50–100 tokens for a 512-token chunk. Tune empirically.

Document-type rules of thumb

Document type	Chunk strategy	Size
Prose articles, books	Recursive split	500–800 tokens
Technical docs with headings	Structure-aware (by section)	section-bounded
API reference, schemas	Per-endpoint or per-entity	natural unit
Legal contracts	Structure-aware + larger chunks	800–1500 tokens
Tables / spreadsheets	Per-row or per-logical-group	row-bounded
Code	Per-function or per-class (AST-based)	function-bounded
Chat / conversations	Per-turn or per-thread	turn-bounded
Slack / ticket logs	Per-thread, time-windowed	thread-bounded

Eval metrics that matter (preview of the eval section later):

Hit@k / Recall@k — does the right chunk appear in top-k? If no, chunking or embedding is the bottleneck.
Context Precision — of retrieved chunks, what fraction were actually relevant?