A RAG call isn't one operation. It's a pipeline of sequential stages, each eating time. Here's the anatomy:

embed query → retrieve → rerank → LLM TTFT → streaming completion
   ~50ms        ~100ms    ~200ms    ~500ms+       varies

Why this matters in an interview: When a customer says "our RAG is slow," you don't say "use a faster model." You ask where in the pipeline the time is going. Each stage has a different fix.

Stage-by-stage breakdown

Embed query (~50ms) This converts the user's query text into a vector. It's usually fast but adds up if you're doing it on every call without caching.

Retrieve (~100ms) This is your vector search — finding the top-k chunks closest to the query vector.

Rerank (~200ms) A second model pass that re-scores retrieved chunks for relevance. This is often the hidden latency villain — it's a separate model call.

LLM Time-to-First-Token / TTFT (~500ms+) The biggest single stage. This is how long before the model starts streaming the first token. Dominated by:

Streaming completion Output token generation. Speed here is tokens/second, which is mostly model-side. You can't optimize much, but you can cap max_output_tokens to prevent runaway generation.