This is a different beast. SFT teaches the model to do a task. Preference tuning teaches the model which of two candidate outputs is better.
1. SFT model (you have a base task-tuned model)
↓
2. Collect preferences (prompt, response_A, response_B, "A is better")
↓
3. Train reward model (predicts human preference given a response)
↓
4. PPO optimization (optimize SFT model to maximize reward,
with KL penalty to stay close to SFT)
The simpler modern alternative. Skip the reward model entirely.
Mathematically derives a loss function (from the Bradley-Terry preference model) that lets you optimize the LLM directly on preference pairs. Same input data as RLHF, but one training step.
| Need | Tool |
|---|---|
| Model doesn't know how to do the task | SFT |
| Model does the task, but you want it to do it more like X, less like Y | Preference tuning |
| Refusal behavior (won't say X / will say Y) | Preference tuning |
| Style polish, tone calibration | Preference tuning (or SFT if you have one canonical style) |