The Latency Budget

Cost Decomposition

Cost Optimization

"Cheapest Test First”

When a customer reports their RAG is slow or expensive, you don't immediately swap models. You work cheapest-to-most-expensive:

1. Check sampling params (temperature, top-p) — are they causing extra generation?
2. Check prompt length — system prompt bloat? Unnecessary examples?
3. Check retrieval count — retrieving 20 chunks when 5 would do?
4. THEN consider model swap — only if above don't move the needle

This is cost-to-diagnose ordered, which maps directly to the troubleshooting framework you already know. The same principle: don't swap the engine before checking if someone left the handbrake on.

The Interview Angle

When a customer says "our inference costs are too high," your answer structure should be:

  1. Decompose first: "Let me understand where the cost is coming from — is this input token volume, output token volume, or call frequency?"
  2. Apply the right lever: context pruning for input bloat, semantic cache for frequency, model routing for quality/cost balance
  3. Quantify the win: "If you're at 10K calls/day with 4K avg input tokens, dropping to 1K tokens via pruning cuts input cost by 75%"
  4. Name the trade-off: pruning = faster + cheaper, but risks losing relevant context → eval your retrieval quality before and after