"I have a million pages of internal documents. I need an LLM-powered Q&A that answers from THOSE documents, with citations, and updates when the documents do."
5. Failure Modes and the Eval Loop
Clarify → receive answers → name the diagnosis → propose fix → discuss trade-offs
Customers without eval are guessing. Your first job is always to remove the guessing. A 30-query eval set takes 2–4 hours and changes every downstream conversation. Always include "build the eval" as step 1 unless the customer explicitly has one.
The line for the interview:
"If they don't have eval, the first deliverable is the eval — not because of process, but because every fix from here is a coin flip without it. Two hours of golden-set construction saves weeks of guessed fixes."
Every "answer is wrong" investigation should fork on this question:
Was the right info in the retrieved chunks?
↓ ↓
YES NO
↓ ↓
Generation problem Retrieval problem
(prompt, model, (parsing, chunking,
lost-in-the-middle, embedding, ranking,
hallucination) filtering)
One question. Halves the search space. Always ask it.