The Latency Budget

A RAG call isn't one operation. It's a pipeline of sequential stages, each eating time. Here's the anatomy:

embed query → retrieve → rerank → LLM TTFT → streaming completion
   ~50ms        ~100ms    ~200ms    ~500ms+       varies

Why this matters in an interview: When a customer says "our RAG is slow," you don't say "use a faster model." You ask where in the pipeline the time is going. Each stage has a different fix.

Stage-by-stage breakdown

Embed query (~50ms) This converts the user's query text into a vector. It's usually fast but adds up if you're doing it on every call without caching.

Optimization: cache frequent queries (exact match). If "what is our return policy?" gets asked 100 times a day, embed it once.
Also: if you have batch workloads (offline ingestion), embed in large batches rather than one-by-one.

Retrieve (~100ms) This is your vector search — finding the top-k chunks closest to the query vector.

Optimization: index choice (approximate nearest neighbor vs exact; HNSW vs IVF). ANN is faster but slightly less precise — usually the right trade-off.
Region matters: if your index is in us-central1 and your Cloud Run service is in us-east1, you're adding ~50ms of network latency for free. Co-locate.
Filter cost: adding metadata filters (e.g., tenant_id = X) can slow retrieval if the index isn't optimized for it. Name this trade-off.

Rerank (~200ms) A second model pass that re-scores retrieved chunks for relevance. This is often the hidden latency villain — it's a separate model call.

Optimization: only rerank when needed. Reranking 20 chunks to return top 3 is expensive. Consider retrieving fewer and skipping rerank for simple queries.
Or: rerank asynchronously and stream an initial answer, then refine (advanced pattern, worth mentioning).

LLM Time-to-First-Token / TTFT (~500ms+) The biggest single stage. This is how long before the model starts streaming the first token. Dominated by:

Input token count: the model must process every input token before generating any output. A 4,000-token context = slow TTFT. A 400-token context = fast TTFT.
Model size: Gemini Flash is ~3-5x faster TTFT vs Gemini Pro. If you don't need Pro-level quality, don't pay for it.
Streaming: always stream responses to the user. Don't wait for completion — TTFT is what the user perceives as "waiting."

Streaming completion Output token generation. Speed here is tokens/second, which is mostly model-side. You can't optimize much, but you can cap max_output_tokens to prevent runaway generation.