This is a different beast. SFT teaches the model to do a task. Preference tuning teaches the model which of two candidate outputs is better.

1. SFT model            (you have a base task-tuned model)
       ↓
2. Collect preferences  (prompt, response_A, response_B, "A is better")
       ↓
3. Train reward model   (predicts human preference given a response)
       ↓
4. PPO optimization     (optimize SFT model to maximize reward,
                         with KL penalty to stay close to SFT)

5b. DPO (Direct Preference Optimization)

The simpler modern alternative. Skip the reward model entirely.

Mathematically derives a loss function (from the Bradley-Terry preference model) that lets you optimize the LLM directly on preference pairs. Same input data as RLHF, but one training step.

5c. When SFT vs Preference Tuning?

Need Tool
Model doesn't know how to do the task SFT
Model does the task, but you want it to do it more like X, less like Y Preference tuning
Refusal behavior (won't say X / will say Y) Preference tuning
Style polish, tone calibration Preference tuning (or SFT if you have one canonical style)