What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Direct Preference Optimization

Direct Preference Optimization (DPO) is a stable, computationally efficient algorithm for aligning large language models with human preferences by directly optimising a policy from comparison data, without training a separate reward model or using reinforcement learning.

6 min readLast updated June 2026Foundations

Direct Preference Optimization (DPO) is a method for aligning large language models (LLMs) with human preferences that bypasses the complexity of reinforcement learning from human feedback (RLHF). Introduced by Rafailov et al. from Stanford University in a 2023 paper, DPO reformulates the preference alignment problem as a simple supervised classification objective. Rather than training a separate reward model and then running a policy-gradient reinforcement learning loop, DPO directly updates the language model's parameters using a dataset of pairwise comparisons between a preferred response and a dispreferred response for the same prompt. This makes alignment training substantially more stable and computationally efficient than RLHF, and the technique has been widely adopted in the development of open-weight instruction-following models.

Motivation: Limitations of RLHF

The dominant approach to alignment before DPO was RLHF, which involves three stages: supervised fine-tuning on demonstrations, training a reward model from human preference comparisons, and using proximal policy optimisation (PPO) or a similar RL algorithm to maximise the reward model's scores while staying close to the supervised reference policy. This pipeline is effective but brittle: reward model training introduces a separate large model with its own failure modes, PPO is sensitive to hyperparameters and prone to instability, and the overall process requires significant engineering effort. Reward hacking — where the model learns to maximise the reward model's score through behaviours the reward model did not anticipate — is a persistent challenge.

The DPO Objective

DPO derives its training objective by analytically solving the constrained optimisation problem that RLHF implicitly addresses, and showing that the optimal policy can be expressed directly in terms of the reference model's output probabilities. This means the reward function is implicitly defined by the ratio of the policy model's probabilities to the reference model's probabilities, and the training objective can be written purely in terms of these probability ratios over chosen and rejected responses.

In practice, the DPO training loop is straightforward: for each training example consisting of a prompt, a preferred response, and a rejected response, the algorithm simultaneously runs both the current policy model and a frozen reference model (typically the supervised fine-tuned model before alignment). It then applies a binary cross-entropy loss that increases the policy's probability of the preferred response relative to the reference model while decreasing its probability of the rejected response. No RL algorithm, value function, or separate reward model is required.

Performance and Adoption

DPO has demonstrated alignment performance comparable to RLHF on instruction-following, summarisation, and dialogue benchmarks, while being significantly easier to implement and less prone to training instability. Models including Zephyr-7B, Tulu, and numerous fine-tunes of Llama, Mistral, and Phi have used DPO or its variants as their primary alignment step. The technique has become standard in the open-source community, with implementations available in Hugging Face's TRL (Transformer Reinforcement Learning) library and in most major fine-tuning frameworks.

Variants

Several variants address limitations of the original DPO formulation. Identity Preference Optimisation (IPO) modifies the loss to prevent probability collapse for very high-quality preferred responses. Kahneman-Tversky Optimisation (KTO) extends DPO to non-paired data, allowing alignment from binary good/bad labels rather than pairwise comparisons. Contrastive Preference Optimisation (CPO) removes the need for a reference model entirely. Robust DPO (rDPO) introduces noise tolerance for imperfect preference annotations. Group Relative Policy Optimisation (GRPO), popularised by DeepSeek's training methodology, extends preference learning to multi-response groups and has shown strong results on reasoning tasks.

Comparison with RLHF

DPO and RLHF represent complementary trade-offs. DPO is simpler, faster, and more stable, making it the preferred choice for most practical alignment applications where a paired preference dataset is available. RLHF remains relevant for online learning scenarios where the preference signal is generated dynamically as the model improves, or where the reward function needs to evolve based on model outputs. Hybrid approaches that combine DPO for initial alignment with online RL refinement are an active area of research.

Malaysian Context — LLM Alignment for Local Language and Regulatory Contexts

The adoption of DPO in Malaysian AI development is primarily relevant to organisations fine-tuning open-weight models for local use cases. Malaysian financial institutions such as Maybank and CIMB, and telecommunications providers such as Maxis and Celcom Digi, have explored building customer-facing AI assistants that must be aligned with both company policy and local regulatory expectations. DPO provides a practical and computationally efficient route to align general-purpose base models with Malaysian-specific behavioural requirements without the engineering complexity of full RLHF pipelines.

Alignment is also relevant to the Malaysian AI Governance Framework published by MDEC and the broader MyDigital Blueprint, both of which emphasise responsible and trustworthy AI deployment. BNM's guidance on the use of AI in financial services, including requirements around transparency, fairness, and the avoidance of harmful outputs, creates a regulatory environment in which alignment techniques like DPO are directly applicable tools for compliance.

Malaysian AI researchers at Universiti Malaya and other institutions exploring fine-tuning of multilingual models for Bahasa Malaysia — including potentially building aligned models for use in government chatbots or educational tools — benefit from DPO's accessibility: it can be run on a single consumer-grade GPU for smaller models, enabling experimentation without large GPU cluster budgets. HRD Corp-funded AI training programmes have begun incorporating alignment techniques into advanced NLP curricula, reflecting demand from Malaysian technology companies looking to deploy aligned LLMs responsibly.

Regionally, AI Singapore's Sea-LION project and similar regional model initiatives face the same challenge of aligning models to Southeast Asian cultural and linguistic norms, and DPO-based approaches have been part of the alignment methodology explored for these models.

References

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems (NeurIPS) 2023. arXiv:2305.18290.
Tunstall, L., Beeching, E., Lambert, N., Rajani, N., Rasul, K., Belkada, Y., Huang, S., von Werra, L., Fourrier, C., Habib, N., Sarrazin, N., Sanseviero, O., Rush, A. M., & Wolf, T. (2023). Zephyr: Direct Distillation of LM Alignment. arXiv:2310.16944.
DeepSeek-AI. (2025). DeepSeek-R1: Incentivising Reasoning Capability in LLMs via Reinforcement Learning. arXiv:2501.12948.
Toloka AI. (2024). Direct Preference Optimization (DPO): A Lightweight Counterpart to RLHF. Toloka Blog. https://toloka.ai/blog/direct-preference-optimization/

Tags:alignment training preference-learning fine-tuning

Abbreviation	DPO
Type	Alignment training algorithm
Proposed by	Rafailov et al., Stanford University, 2023
Alternative to	Reinforcement Learning from Human Feedback (RLHF)
Key advantage	No separate reward model; single supervised training loop
Related	RLHF, Fine-tuning, Constitutional AI, AI Alignment

Motivation: Limitations of RLHF

The DPO Objective

Performance and Adoption

Variants

Comparison with RLHF

See Also

References

References