RAG vs Fine-Tuning vs Long context

Fine-tune when:

You need a style, tone, or format the base model can't reliably hit (legal brief voice, medical SOAP notes, JSON schema adherence on a custom format)
You need a domain vocabulary baked in (medical coding, legal citations, biotech jargon) where retrieval alone keeps producing awkward generations
You need latency or cost reduction — a fine-tuned small model can match a prompted large one for a specific task at 1/10th the cost
The behavior is stable — you're not chasing a moving target. Fine-tuning a quarterly-changing policy is a treadmill
Classification / extraction / routing tasks with a fixed taxonomy

Rule of thumb: fine-tune for HOW the model behaves, not WHAT it knows.

Long context when:

The corpus genuinely fits (a single contract, a codebase under ~500K tokens, one earnings report)
The task needs whole-document reasoning — cross-references, "compare section 3 to section 17," summarize-the-whole-thing — where chunking would shatter the structure
One-shot or low-volume queries where the per-call cost is acceptable
You're prototyping and want to skip retrieval infrastructure for v0

Rule of thumb: long context for narrow, deep, single-document reasoning. Not for corpus-scale Q&A.

RAG when:

Corpus is large, growing, or changing (anything bigger than what fits in context, anything that updates more than monthly)
You need citations / provenance
You need access control — different users see different documents
You need multi-tenancy — different customers, isolated indices
The default for enterprise knowledge Q&A, support agents, internal copilots

Hybrid is common and worth naming aloud: fine-tune a small model for the style of answers + RAG for the content. Or: long-context for the active document the user is editing + RAG for the rest of the knowledge base.