3a. Full Fine-Tuning
Update every weight in the model. The "old way."
- Memory cost: ~4× model size at training time (weights + gradients + Adam optimizer states)
- A 7B model in fp16 needs ~14 GB just for weights → ~56 GB to train
- Risk: catastrophic forgetting — model loses general capabilities while specializing
- When to use: almost never, for FDE work. Rarely justified outside frontier labs.
3b. PEFT — Parameter-Efficient Fine-Tuning (the umbrella)
The insight that changed everything: empirically, fine-tuning updates have low intrinsic rank. You don't need to touch every weight to specialize a model. Update a tiny subset, or add small new trainable parameters. Freeze the base.
Methods under PEFT:
- LoRA — by far the most common
- QLoRA — LoRA + 4-bit quantized base
- Adapters — small bottleneck layers inserted between transformer blocks (older sibling of LoRA)
- Prefix tuning / P-tuning — train a "soft prompt" (continuous embeddings) prepended to input (mostly historical now)
3c. LoRA — Low-Rank Adaptation (deep dive)
The mechanism:
Instead of learning a new weight matrix W', learn a low-rank update to the frozen W:
W_effective = W_frozen + ΔW
where ΔW = B × A
W : d × d (frozen, e.g., 4096 × 4096)
B : d × r (trained, e.g., 4096 × 8)
A : r × d (trained, e.g., 8 × 4096)
r : the rank (typically 8, 16, 32, 64)
If d = 4096 and r = 8:
- Full matrix W: ~16.7M params (frozen)
- LoRA update BA: only ~65K trainable params