The 7 Categories of RAG Clarifying Questions
Mental model: you're asking questions across users, data, retrieval needs, generation needs, operations, compliance, and constraints. Always in roughly that order — start with users (most important, most often skipped) and end with constraints (budget/timeline are technical reality but should not drive architecture).
1. Users & Use Case (the "why")
This is the single most important category. If you don't know who's using it and for what, every other decision is a guess.
- Who is the user? Internal employees, external customers, both? Technical users or non-technical? One persona or many?
- What task are they trying to accomplish? Looking up specific facts? Synthesizing across documents? Drafting content? Decision support? Each has a different RAG shape.
- What does success look like? "Faster answers" vs "fewer support tickets" vs "deflect X% of queries" — concrete success metric drives evaluation strategy.
- What's the current process they'd be replacing? "We currently search SharePoint and ask Karen in compliance." This tells you both the bar to beat and the failure mode they'll be most sensitive to.
- What happens when the answer is wrong? A wrong answer in legal contract analysis is catastrophic. A wrong answer in "summarize this meeting" is annoying. This single question determines how much you invest in evals, guardrails, and citations.
- Volume? 10 queries a day or 10,000 per second? Burst patterns?
The user-and-stakes questions are the ones interviewers grade hardest because they're consulting signals.
2. Data — the corpus
This is where most RAG systems actually live or die. The hardest part of production RAG is rarely the LLM; it's the data.
- What documents? Format types — PDFs, Word docs, HTML, Confluence, Slack, code, transcripts, scanned images? Each has different parsing needs.
- How many? Hundreds, thousands, millions? Total token count is what matters for cost and index sizing.
- How is it structured? Mostly prose, mostly tables, mixed? Lots of figures and charts that matter? Hierarchical (sections, subsections) or flat?
- Quality? Clean and well-maintained, or noisy and outdated? Are there duplicates, multiple versions of the same doc, drafts mixed with finals?
- Where does it live? GCS, SharePoint, Confluence, S3, on-prem file share, a customer data warehouse, behind an API? This drives ingestion architecture and likely 30% of the engineering effort.