5. Preference Tuning — RLHF and DPO

This is a different beast. SFT teaches the model to do a task. Preference tuning teaches the model which of two candidate outputs is better.

1. SFT model            (you have a base task-tuned model)
       ↓
2. Collect preferences  (prompt, response_A, response_B, "A is better")
       ↓
3. Train reward model   (predicts human preference given a response)
       ↓
4. PPO optimization     (optimize SFT model to maximize reward,
                         with KL penalty to stay close to SFT)

5b. DPO (Direct Preference Optimization)

The simpler modern alternative. Skip the reward model entirely.

Mathematically derives a loss function (from the Bradley-Terry preference model) that lets you optimize the LLM directly on preference pairs. Same input data as RLHF, but one training step.

More stable than PPO
Easier to tune
Increasingly the default for "I want to align behavior" use cases
Vertex AI, Anthropic, and most open-source toolchains support it

5c. When SFT vs Preference Tuning?

Need	Tool
Model doesn't know how to do the task	SFT
Model does the task, but you want it to do it more like X, less like Y	Preference tuning
Refusal behavior (won't say X / will say Y)	Preference tuning
Style polish, tone calibration	Preference tuning (or SFT if you have one canonical style)