What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Reinforcement Learning from Human Feedback

A machine learning technique that trains a reward model from human preference data and uses it to align large language models with human values, safety requirements, and intended behaviour through reinforcement learning.

7 min readLast updated May 2026Foundations

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that incorporates human evaluator preferences into the learning process of a machine learning model. Rather than optimising against a fixed, programmatically defined reward signal, RLHF trains a reward model to represent human preferences — typically collected by asking annotators to rank or compare model outputs — and then uses that reward model as the optimisation target for a reinforcement learning algorithm. The technique has become the dominant method for aligning large language models (LLMs) with intended human values, producing models that are more helpful, less harmful, and more accurately instruction-following than those trained on text prediction alone.[^1]

Motivation

A language model pre-trained on a large text corpus learns to predict the next token in a sequence based on statistical patterns in that corpus. This objective does not inherently align with human goals: a model optimised purely for next-token prediction may generate text that is grammatically fluent and topically coherent but factually wrong, harmful, or inconsistent with a user's actual intent.

Early alignment efforts relied on manually curated datasets and supervised fine-tuning on examples of ideal behaviour, but this approach does not scale to the full diversity of human preferences, values, and situations. RLHF addresses this by learning a continuous reward signal from human comparisons, enabling the model to generalise alignment preferences to unseen inputs.

How RLHF Works

RLHF involves three sequential stages.

Supervised Fine-Tuning (SFT) is the first stage. A base pre-trained language model is fine-tuned on a curated dataset of prompts paired with high-quality human-written responses. This establishes a sensible starting policy before reinforcement learning is applied.

Reward Model Training is the second stage. Human annotators are shown multiple model-generated responses to the same prompt and asked to rank or compare them according to their preference — for helpfulness, honesty, harmlessness, or task completion. These preference pairs are used to train a separate reward model that learns to predict which responses humans prefer. The reward model is typically initialised from the SFT model with a regression head replacing the language modelling head.

Reinforcement Learning Optimisation is the third stage. The SFT model is used as the initial policy, and an RL algorithm — most commonly Proximal Policy Optimisation (PPO) — generates responses to prompts. The reward model scores each generated response. The RL algorithm updates the policy's parameters to produce responses that receive higher reward scores. A Kullback–Leibler (KL) divergence penalty against the original SFT model is typically applied to prevent the policy from drifting too far from coherent language generation in pursuit of high reward scores — a problem known as reward hacking.[^2]

InstructGPT and GPT-4

The technique gained widespread attention through OpenAI's 2022 paper on InstructGPT, which demonstrated that a 1.3 billion parameter model fine-tuned with RLHF was preferred by human evaluators over a 175 billion parameter GPT-3 model on instruction-following tasks. This result showed that alignment quality, not just scale, determines perceived model quality.[^3]

GPT-4 and subsequent OpenAI models, as well as Anthropic's Claude series, Meta's Llama series, and Google's Gemini models, all incorporate RLHF or closely related alignment techniques as a core training stage.

Variants and Successors

Several variants and extensions of RLHF have been developed to address its computational expense, instability, and dependence on large volumes of human preference data.

Direct Preference Optimisation (DPO), introduced in 2023, reformulates the RLHF objective to train the policy directly on preference data without a separate reward model or online RL training loop. DPO is simpler to implement and more stable than PPO-based RLHF, and has been adopted widely in open-source model fine-tuning.[^4]

Constitutional AI (CAI), developed by Anthropic, replaces human preference data in part with a set of principles (a "constitution") that the model uses to critique and revise its own outputs. A reward model is then trained on these AI-generated preference comparisons, reducing the volume of human annotation required.

RLAIF (Reinforcement Learning from AI Feedback) extends the principle of CAI by using a capable AI model to generate preference labels at scale, enabling alignment on a much larger pool of comparisons than human annotators can produce.

| Method | Reward source | Training complexity | Adoption | |--------|---------------|---------------------|----------| | RLHF (PPO) | Human annotators | High | GPT-4, Gemini | | DPO | Human preferences (offline) | Low | Llama 3, Mistral | | Constitutional AI | AI + principles | Medium | Claude | | RLAIF | AI feedback | Medium | Gemini |

Challenges

RLHF is computationally expensive, requiring multiple model copies to be held in memory simultaneously during PPO training. Reward hacking — where the policy exploits weaknesses in the reward model rather than genuinely improving — remains a persistent challenge. Human annotator disagreement introduces noise in preference datasets, particularly on value-laden questions where preferences vary across cultures and individuals. Scaling human annotation to the volume required for frontier models is costly and logistically complex.

Malaysian Context — RLHF and AI Alignment Considerations

The principles underlying RLHF — that AI systems should behave in accordance with human preferences and societal values — are central to Malaysia's approach to AI governance. The Malaysia AI Governance Framework (2020, updated 2024) and the MyDigital Blueprint both emphasise responsible AI development, including requirements for AI systems deployed in public-facing services to be aligned, explainable, and auditable.

Bank Negara Malaysia's Risk Management in Technology (RMIT) guidelines require financial institutions to ensure that AI systems produce consistent and appropriate outputs. In practice, this has led Malaysian banks such as Maybank, CIMB, and Hong Leong Bank to adopt models fine-tuned with RLHF or DPO for customer service applications, prioritising helpfulness and regulatory compliance in the alignment process.

Malaysia's position as a significant consumer of frontier AI models rather than a developer of them means that RLHF's direct practice is concentrated in a small number of organisations. CelcomDigi, Telekom Malaysia, and AI-focused startups operating within the Malaysia Digital Hub at Cyberjaya have experimented with open-source models such as Llama and Mistral, applying DPO-based alignment techniques to customise model behaviour for Malay-language contexts and Malaysian regulatory environments.

The cultural alignment dimension of RLHF is particularly relevant in Malaysia's multilingual, multireligious society. Preference data collected from annotators in the United States or Europe may not reflect Malaysian cultural norms, religious sensitivities (including considerations related to Islam as the official religion), or linguistic preferences across Bahasa Malaysia, Mandarin Chinese, and Tamil. MDEC and NAIO have identified the development of culturally representative preference data and locally aligned AI models as a priority for Malaysia's national AI strategy.

The Universiti Malaya Centre of AI Technology (UMCAI) and UTM's AI research group have published work on adapting alignment techniques for low-resource Malay-language settings, contributing to the broader ASEAN AI research community.

References

Lambert, N. (2024). Reinforcement Learning from Human Feedback. RLHF Book. https://rlhfbook.com/
Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.
Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36.

Tags:RLHF alignment reinforcement learning reward model AI safety

Abbreviation	RLHF
Type	AI alignment and training technique
Pioneered by	OpenAI (InstructGPT, 2022)
Key use	LLM alignment, helpfulness, safety, instruction following
Related	Constitutional AI, DPO, reward model, AI safety

Motivation

How RLHF Works

InstructGPT and GPT-4

Variants and Successors

Challenges

See Also

References

References