Cost Decomposition

The formula is simple and you should know it cold:

cost = (input_tokens × $/M_in) + (output_tokens × $/M_out)

The key insight: in RAG, input tokens dominate. Why? Because every retrieved chunk you stuff into the context adds to your input token count. If you retrieve 10 chunks × 500 tokens each, that's 5,000 tokens of context before you even write your system prompt or the user's question.

Output tokens matter more for tasks like summarization or code generation where the answer is long. For Q&A RAG, the answer is usually short — so input dominates.

The 5 cost levers (know all five)

1. Prompt caching If your system prompt is 2,000 tokens and it's the same on every call, the model re-processes it every time. Prompt caching lets you pay once and reuse. Gemini supports this. Cost reduction can be dramatic on high-volume applications.

2. Semantic caching Exact match caching handles "what is your return policy?" being asked identically. But users rephrase. Semantic caching embeds incoming queries and checks if a similar query was already answered (cosine similarity > threshold). If yes, return cached answer without hitting the LLM. Tools: Redis + embedding similarity layer.

3. Model routing Not every query needs your most expensive model. A three-tier pattern:

Flash → 80% of queries (simple lookups, factual Q&A)
Pro → 15% (complex reasoning, multi-step)
Human escalation → 5% (out-of-scope, high-stakes)

A lightweight classifier or even a short prompt to Flash itself can do the routing. The cost savings are multiplicative.

4. Context pruning Fewer retrieved chunks = fewer input tokens = lower cost + faster TTFT. Strategies:

Retrieve top-3 instead of top-10 if precision is high
Truncate chunks (first N sentences often carry the key info)
Use a reranker to filter aggressively before passing to LLM

5. Smaller embeddings Embedding model choice affects ingestion cost and storage. Smaller models (e.g., text-embedding-004 vs a larger variant) can be nearly as good at retrieval for structured domains while being cheaper to run at scale. Name this as a lever even if you don't always use it.