Fine-tune when:
- You need a style, tone, or format the base model can't reliably hit (legal brief voice, medical SOAP notes, JSON schema adherence on a custom format)
- You need a domain vocabulary baked in (medical coding, legal citations, biotech jargon) where retrieval alone keeps producing awkward generations
- You need latency or cost reduction — a fine-tuned small model can match a prompted large one for a specific task at 1/10th the cost
- The behavior is stable — you're not chasing a moving target. Fine-tuning a quarterly-changing policy is a treadmill
- Classification / extraction / routing tasks with a fixed taxonomy
Rule of thumb: fine-tune for HOW the model behaves, not WHAT it knows.
Long context when:
- The corpus genuinely fits (a single contract, a codebase under ~500K tokens, one earnings report)
- The task needs whole-document reasoning — cross-references, "compare section 3 to section 17," summarize-the-whole-thing — where chunking would shatter the structure
- One-shot or low-volume queries where the per-call cost is acceptable
- You're prototyping and want to skip retrieval infrastructure for v0
Rule of thumb: long context for narrow, deep, single-document reasoning. Not for corpus-scale Q&A.
RAG when:
- Corpus is large, growing, or changing (anything bigger than what fits in context, anything that updates more than monthly)
- You need citations / provenance
- You need access control — different users see different documents
- You need multi-tenancy — different customers, isolated indices
- The default for enterprise knowledge Q&A, support agents, internal copilots
Hybrid is common and worth naming aloud: fine-tune a small model for the style of answers + RAG for the content. Or: long-context for the active document the user is editing + RAG for the rest of the knowledge base.