Prompt caching
Both Gemini and Anthropic support it. If the example block is identical across all 2M requests, you can cache it and pay a fraction of the input cost on cache hits (often ~10–25% of normal input price). This can drop the bill 60–80% with zero accuracy change and one config flag.
dynamic prompting
detect the complexity of the incoming query, and then you decide which technique to use in your prompt itself.
The technique ladder, cheapest → most expensive, looks roughly like this:
Each rung up costs more in latency, dollars, complexity, and maintenance.
1. zero-shot → few-shot → few-shot + CoT → RAG (retrieve definitions/examples)
→ fine-tune → train your own
2. prompt caching → smaller model → retrieval few-shot (fewer examples per call)
→ fine-tune
zero-shot direct → 1x cost
few-shot → ~5x
CoT (1 path) → ~20x cost (longer output)
self-consistency N=10 → ~200x cost
ReAct → ~Nx
Reflexion (3 tries) → ~3x of whatever the base trajectory costs
ToT (50 nodes,eval ea) → ~2000–5000x cost
thinking → variable
zero-shot — no scaffolding, trust the model
few-shot — show pattern by example
CoT — show your reasoning
self-consistency — N paths, majority vote (fixes noise)
ToT — tree search with evaluation (rarely right in production)
ReAct — reason + act + observe (the dominant agent pattern)
Reflexion — self-critique + retry (fixes bias, needs evaluator)