3. The Methods — Full FT, PEFT, LoRA, QLoRA

3a. Full Fine-Tuning

Update every weight in the model. The "old way."

Memory cost: ~4× model size at training time (weights + gradients + Adam optimizer states)
A 7B model in fp16 needs ~14 GB just for weights → ~56 GB to train
Risk: catastrophic forgetting — model loses general capabilities while specializing
When to use: almost never, for FDE work. Rarely justified outside frontier labs.

3b. PEFT — Parameter-Efficient Fine-Tuning (the umbrella)

The insight that changed everything: empirically, fine-tuning updates have low intrinsic rank. You don't need to touch every weight to specialize a model. Update a tiny subset, or add small new trainable parameters. Freeze the base.

Methods under PEFT:

LoRA — by far the most common
QLoRA — LoRA + 4-bit quantized base
Adapters — small bottleneck layers inserted between transformer blocks (older sibling of LoRA)
Prefix tuning / P-tuning — train a "soft prompt" (continuous embeddings) prepended to input (mostly historical now)

3c. LoRA — Low-Rank Adaptation (deep dive)

The mechanism:

Instead of learning a new weight matrix W', learn a low-rank update to the frozen W:

W_effective = W_frozen + ΔW
where ΔW = B × A

W : d × d        (frozen, e.g., 4096 × 4096)
B : d × r        (trained, e.g., 4096 × 8)
A : r × d        (trained, e.g., 8 × 4096)
r : the rank     (typically 8, 16, 32, 64)

If d = 4096 and r = 8:

Full matrix W: ~16.7M params (frozen)
LoRA update BA: only ~65K trainable params