Phase 1: Indexing (offline, batch, runs when docs change)
Documents → Parse → Chunk → Embed → Store in vector DB (+ metadata)
Phase 2: Querying (online, per-request, runs on every user question)
User query → Embed query → Vector search → Retrieve top-k chunks →
Build prompt (query + chunks) → LLM → Answer (+ citations)
Indexing
Querying
Everything I just described is naive RAG. It works for a demo. It fails in production for predictable reasons:
- Vector search alone misses exact-keyword matches (acronyms, product codes, names)
- Top-k by similarity is not the same as top-k by usefulness
- Chunks lose context when split (the sentence "It is not refundable" is useless without knowing what "it" is)
- One-shot retrieval can't handle multi-hop questions ("What did the CEO say about the product mentioned in last quarter's earnings?")
- No eval loop = no idea if it's working
We'll address each in:
4. Advanced RAG toolkit