The core problem: an LLM's knowledge is baked in at training time. That makes it four things at once that customers can't accept:
- Stale — training cutoff means recent events, policy changes, product updates are invisible
- Generic — it knows the public internet, not your customer's contracts, internal wiki, support tickets, or proprietary docs
- Lossy — even on topics it "knows," it compresses information badly. Specific numbers, exact policies, edge-case product details get smeared
- Untraceable — when it answers, it can't cite a source. For regulated industries (healthcare, finance, legal) that's a non-starter
The three options to solve this
Option A: Fine-tune the model on the customer's docs.
- Expensive per training run, slow iteration (days to weeks per cycle)
- Goes stale the moment the corpus changes
- Still can't cite which document an answer came from — fine-tuning blends knowledge into weights
- Hard to remove a single document (right-to-be-forgotten, GDPR, deletion requests)
Option B: Stuff everything into a long context window.
- Token cost scales linearly with corpus size on every single call
- "Lost in the middle" — models degrade on facts buried in long contexts
- Latency spikes with context size
- 1M-token windows help, but you can't put a 10M-document corpus in context
- No selectivity — the model wastes attention budget on irrelevant material
Option C: RAG — retrieve the relevant chunks at query time, give those to the LLM.
- Fresh: re-index when docs change, no retraining
- Cheap iteration: chunk strategy, retriever, prompt are all swappable independently