Stage 5 — Embed the query

Same embedding model as the corpus. Critically: must be the same model. Different models = different vector spaces = garbage retrieval.

Stage 6 — Vector search

Find the top-k chunks whose vectors are closest to the query vector. "Closest" is usually cosine similarity or dot product.

Typical k: 5–20. Bigger k = more recall but more tokens into the LLM = more cost and more "lost in the middle" risk.

Add metadata filters here: tenant ID, date range, document type, user permissions. This is how you enforce ACLs — filter at retrieval, not at generation. (An interviewer will love hearing this — "I never let an LLM be the security boundary.")

Stage 7 — Build the prompt

Stuff the retrieved chunks into a prompt template.

Three things to notice:

  1. Source markers inside the prompt — this is how the LLM produces citations. You're teaching it the format.
  2. "Use ONLY the context" — anti-hallucination instruction. Doesn't fully work, but reduces it.
  3. "If not in context, say so" — explicit permission to abstain. Critical for trust.

Stage 8 — LLM call

Gemini 2.5 Pro for hard reasoning, Flash for cost/latency, Flash-Lite for high-volume cheap. Sometimes you route based on query difficulty (a pattern worth naming for the perf/cost RRK day).

Stage 9 — Return answer + citations

Parse the LLM's output, attach the source links/metadata to surface citations to the user. This is the bit that makes RAG defensible to a regulated customer.