Same embedding model as the corpus. Critically: must be the same model. Different models = different vector spaces = garbage retrieval.
Find the top-k chunks whose vectors are closest to the query vector. "Closest" is usually cosine similarity or dot product.
Typical k: 5–20. Bigger k = more recall but more tokens into the LLM = more cost and more "lost in the middle" risk.
Add metadata filters here: tenant ID, date range, document type, user permissions. This is how you enforce ACLs — filter at retrieval, not at generation. (An interviewer will love hearing this — "I never let an LLM be the security boundary.")
Stuff the retrieved chunks into a prompt template.
Three things to notice:
Gemini 2.5 Pro for hard reasoning, Flash for cost/latency, Flash-Lite for high-volume cheap. Sometimes you route based on query difficulty (a pattern worth naming for the perf/cost RRK day).
Parse the LLM's output, attach the source links/metadata to surface citations to the user. This is the bit that makes RAG defensible to a regulated customer.