Direct Preference Optimization
Direct Preference Optimization (DPO) is a stable, computationally efficient algorithm for aligning large language models with human preferences by directly optimising a policy from comparison data, without training a separate reward model or using reinforcement learning.
Direct Preference Optimization (DPO) is a method for aligning large language models (LLMs) with human preferences that bypasses the complexity of reinforcement learning from human feedback (RLHF). Introduced by Rafailov et al. from Stanford University in a 2023 paper, DPO reformulates the preference alignment problem as a simple supervised classification objective. Rather than training a separate reward model and then running a policy-gradient reinforcement learning loop, DPO directly updates the language model's parameters using a dataset of pairwise comparisons between a preferred response and a dispreferred response for the same prompt. This makes alignment training substantially more stable and computationally efficient than RLHF, and the technique has been widely adopted in the development of open-weight instruction-following models.
Motivation: Limitations of RLHF
The dominant approach to alignment before DPO was RLHF, which involves three stages: supervised fine-tuning on demonstrations, training a reward model from human preference comparisons, and using proximal policy optimisation (PPO) or a similar RL algorithm to maximise the reward model's scores while staying close to the supervised reference policy. This pipeline is effective but brittle: reward model training introduces a separate large model with its own failure modes, PPO is sensitive to hyperparameters and prone to instability, and the overall process requires significant engineering effort. Reward hacking — where the model learns to maximise the reward model's score through behaviours the reward model did not anticipate — is a persistent challenge.
The DPO Objective
DPO derives its training objective by analytically solving the constrained optimisation problem that RLHF implicitly addresses, and showing that the optimal policy can be expressed directly in terms of the reference model's output probabilities. This means the reward function is implicitly defined by the ratio of the policy model's probabilities to the reference model's probabilities, and the training objective can be written purely in terms of these probability ratios over chosen and rejected responses.
In practice, the DPO training loop is straightforward: for each training example consisting of a prompt, a preferred response, and a rejected response, the algorithm simultaneously runs both the current policy model and a frozen reference model (typically the supervised fine-tuned model before alignment). It then applies a binary cross-entropy loss that increases the policy's probability of the preferred response relative to the reference model while decreasing its probability of the rejected response. No RL algorithm, value function, or separate reward model is required.
Performance and Adoption
DPO has demonstrated alignment performance comparable to RLHF on instruction-following, summarisation, and dialogue benchmarks, while being significantly easier to implement and less prone to training instability. Models including Zephyr-7B, Tulu, and numerous fine-tunes of Llama, Mistral, and Phi have used DPO or its variants as their primary alignment step. The technique has become standard in the open-source community, with implementations available in Hugging Face's TRL (Transformer Reinforcement Learning) library and in most major fine-tuning frameworks.
Variants
Several variants address limitations of the original DPO formulation. Identity Preference Optimisation (IPO) modifies the loss to prevent probability collapse for very high-quality preferred responses. Kahneman-Tversky Optimisation (KTO) extends DPO to non-paired data, allowing alignment from binary good/bad labels rather than pairwise comparisons. Contrastive Preference Optimisation (CPO) removes the need for a reference model entirely. Robust DPO (rDPO) introduces noise tolerance for imperfect preference annotations. Group Relative Policy Optimisation (GRPO), popularised by DeepSeek's training methodology, extends preference learning to multi-response groups and has shown strong results on reasoning tasks.
Comparison with RLHF
DPO and RLHF represent complementary trade-offs. DPO is simpler, faster, and more stable, making it the preferred choice for most practical alignment applications where a paired preference dataset is available. RLHF remains relevant for online learning scenarios where the preference signal is generated dynamically as the model improves, or where the reward function needs to evolve based on model outputs. Hybrid approaches that combine DPO for initial alignment with online RL refinement are an active area of research.
See Also
References
References
- Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems (NeurIPS) 2023. arXiv:2305.18290.
- Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., & Wolf, T. (2023). Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944.
- DeepSeek-AI. (2025). DeepSeek-R1: Incentivising Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
- Toloka AI. (2024). Direct Preference Optimization (DPO): A Lightweight Counterpart to RLHF. Toloka Blog. https://toloka.ai/blog/direct-preference-optimization/