AIWiki
Malaysia

Reinforcement Learning from Human Feedback

A machine learning technique that trains a reward model from human preference data and uses it to align large language models with human values, safety requirements, and intended behaviour through reinforcement learning.

7 min readLast updated May 2026Foundations

Reinforcement Learning from Human Feedback (RLHF) is a training methodology that incorporates human evaluator preferences into the learning process of a machine learning model. Rather than optimising against a fixed, programmatically defined reward signal, RLHF trains a reward model to represent human preferences — typically collected by asking annotators to rank or compare model outputs — and then uses that reward model as the optimisation target for a reinforcement learning algorithm. The technique has become the dominant method for aligning large language models (LLMs) with intended human values, producing models that are more helpful, less harmful, and more accurately instruction-following than those trained on text prediction alone.[^1]

Motivation

A language model pre-trained on a large text corpus learns to predict the next token in a sequence based on statistical patterns in that corpus. This objective does not inherently align with human goals: a model optimised purely for next-token prediction may generate text that is grammatically fluent and topically coherent but factually wrong, harmful, or inconsistent with a user's actual intent.

Early alignment efforts relied on manually curated datasets and supervised fine-tuning on examples of ideal behaviour, but this approach does not scale to the full diversity of human preferences, values, and situations. RLHF addresses this by learning a continuous reward signal from human comparisons, enabling the model to generalise alignment preferences to unseen inputs.

How RLHF Works

RLHF involves three sequential stages.

Supervised Fine-Tuning (SFT) is the first stage. A base pre-trained language model is fine-tuned on a curated dataset of prompts paired with high-quality human-written responses. This establishes a sensible starting policy before reinforcement learning is applied.

Reward Model Training is the second stage. Human annotators are shown multiple model-generated responses to the same prompt and asked to rank or compare them according to their preference — for helpfulness, honesty, harmlessness, or task completion. These preference pairs are used to train a separate reward model that learns to predict which responses humans prefer. The reward model is typically initialised from the SFT model with a regression head replacing the language modelling head.

Reinforcement Learning Optimisation is the third stage. The SFT model is used as the initial policy, and an RL algorithm — most commonly Proximal Policy Optimisation (PPO) — generates responses to prompts. The reward model scores each generated response. The RL algorithm updates the policy's parameters to produce responses that receive higher reward scores. A Kullback–Leibler (KL) divergence penalty against the original SFT model is typically applied to prevent the policy from drifting too far from coherent language generation in pursuit of high reward scores — a problem known as reward hacking.[^2]

InstructGPT and GPT-4

The technique gained widespread attention through OpenAI's 2022 paper on InstructGPT, which demonstrated that a 1.3 billion parameter model fine-tuned with RLHF was preferred by human evaluators over a 175 billion parameter GPT-3 model on instruction-following tasks. This result showed that alignment quality, not just scale, determines perceived model quality.[^3]

GPT-4 and subsequent OpenAI models, as well as Anthropic's Claude series, Meta's Llama series, and Google's Gemini models, all incorporate RLHF or closely related alignment techniques as a core training stage.

Variants and Successors

Several variants and extensions of RLHF have been developed to address its computational expense, instability, and dependence on large volumes of human preference data.

Direct Preference Optimisation (DPO), introduced in 2023, reformulates the RLHF objective to train the policy directly on preference data without a separate reward model or online RL training loop. DPO is simpler to implement and more stable than PPO-based RLHF, and has been adopted widely in open-source model fine-tuning.[^4]

Constitutional AI (CAI), developed by Anthropic, replaces human preference data in part with a set of principles (a "constitution") that the model uses to critique and revise its own outputs. A reward model is then trained on these AI-generated preference comparisons, reducing the volume of human annotation required.

RLAIF (Reinforcement Learning from AI Feedback) extends the principle of CAI by using a capable AI model to generate preference labels at scale, enabling alignment on a much larger pool of comparisons than human annotators can produce.

| Method | Reward source | Training complexity | Adoption | |--------|---------------|---------------------|----------| | RLHF (PPO) | Human annotators | High | GPT-4, Gemini | | DPO | Human preferences (offline) | Low | Llama 3, Mistral | | Constitutional AI | AI + principles | Medium | Claude | | RLAIF | AI feedback | Medium | Gemini |

Challenges

RLHF is computationally expensive, requiring multiple model copies to be held in memory simultaneously during PPO training. Reward hacking — where the policy exploits weaknesses in the reward model rather than genuinely improving — remains a persistent challenge. Human annotator disagreement introduces noise in preference datasets, particularly on value-laden questions where preferences vary across cultures and individuals. Scaling human annotation to the volume required for frontier models is costly and logistically complex.

See Also

References

References

  1. Lambert, N. (2024). Reinforcement Learning from Human Feedback. RLHF Book. https://rlhfbook.com/
  2. Ziegler, D. M., Stiennon, N., Wu, J., Brown, T. B., Radford, A., Amodei, D., Christiano, P., & Irving, G. (2019). Fine-Tuning Language Models from Human Preferences. arXiv:1909.08593.
  3. Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., Schulman, J., Hilton, J., Kelton, F., Miller, L., Simens, M., Askell, A., Welinder, P., Christiano, P., Leike, J., & Lowe, R. (2022). Training language models to follow instructions with human feedback. arXiv:2203.02155.
  4. Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems, 36.