When a customer reports their RAG is slow or expensive, you don't immediately swap models. You work cheapest-to-most-expensive:
1. Check sampling params (temperature, top-p) — are they causing extra generation?
2. Check prompt length — system prompt bloat? Unnecessary examples?
3. Check retrieval count — retrieving 20 chunks when 5 would do?
4. THEN consider model swap — only if above don't move the needle
This is cost-to-diagnose ordered, which maps directly to the troubleshooting framework you already know. The same principle: don't swap the engine before checking if someone left the handbrake on.
When a customer says "our inference costs are too high," your answer structure should be: