Reward Modeling
Reward modeling is the process of training a neural network to predict human or AI preferences over model outputs, providing a scalable reward signal for reinforcement learning-based alignment of language models.
Reward modeling — also called preference modeling — is the training of a neural network to predict which of two or more candidate outputs a human (or AI judge) would prefer, given a particular input prompt. The resulting model, called a reward model, assigns a scalar score to any prompt-response pair, quantifying how well the response meets desired criteria such as helpfulness, harmlessness, honesty, or task-specific accuracy. Reward models serve as the central component of Reinforcement Learning from Human Feedback (RLHF), where they provide the differentiable reward signal that guides policy optimisation.
Reward modeling emerged as a practical solution to the challenge of specifying complex human preferences as a formal objective function. Rather than hand-coding criteria for what constitutes a good response — a task that is extremely difficult for open-ended natural language generation — reward modeling learns these criteria from comparative human judgements, which are substantially easier to collect than absolute quality ratings.
How Reward Models Are Trained
Preference Data Collection
The foundation of reward model training is a dataset of preference pairs. For each item in this dataset, a prompt is shown alongside two or more candidate responses generated by a language model. Human annotators — or an AI feedback model in the case of RLAIF — select which response is better according to a rubric. The rubric typically includes dimensions such as factual accuracy, instruction following, safety (avoidance of harmful content), and overall helpfulness.
Preference annotation is typically conducted through a comparative interface rather than by asking annotators to rate responses on an absolute scale. Comparative judgement is more reliable because humans are better at making relative comparisons ("A is better than B") than absolute assessments ("A is 7.4/10").
Model Architecture and Training Objective
A reward model is initialised from a pre-trained or instruction-tuned language model, with the final token prediction head replaced by a linear layer that outputs a single scalar value. Given a sequence of tokens representing a prompt concatenated with a response, the reward model produces a score predicting quality.
Training uses a pairwise ranking loss. For each preference pair (prompt, chosen response, rejected response), the loss function penalises the model when it scores the rejected response higher than the chosen response. The standard formulation uses the Bradley-Terry model of pairwise preferences, optimised with a log-sigmoid loss:
loss = -log(sigmoid(score(chosen) - score(rejected)))
Minimising this loss causes the reward model to assign systematically higher scores to preferred responses.
Scaling and Data Considerations
Reward model performance improves with the size of the base model and the quantity and quality of preference data. Research has found that reward models trained on diverse prompts and responses generalise better than those trained on narrow distributions. Annotation quality is critical: noisy or inconsistent preference labels introduce spurious gradients that hurt generalisation.
Ensemble reward models — multiple reward models trained with different random seeds or on different data subsets — provide more stable and calibrated reward estimates than single models. Ensembling also enables uncertainty estimation: high disagreement among ensemble members signals that the reward model is uncertain about a given output.
Role in RLHF
In the full RLHF pipeline, the reward model is used as a surrogate for human judgement during policy optimisation. The policy model (the language model being aligned) generates responses to prompts, the reward model scores each response, and a reinforcement learning algorithm (typically Proximal Policy Optimisation, PPO) updates the policy to increase average reward.
Because the reward model is a proxy rather than the true human preference function, there is a risk of reward hacking — the policy discovering high-scoring responses that satisfy the reward model's learned criteria without satisfying the underlying human intent. This is analogous to Goodhart's Law: when a measure becomes a target, it ceases to be a good measure. KL divergence penalties that keep the aligned policy close to the reference SFT model help mitigate reward hacking but do not eliminate it entirely.
Process and Outcome Reward Models
Outcome reward models (ORMs) score complete responses — judging the final answer to a question or the last turn of a dialogue without examining the reasoning process. ORMs are simpler to train but can be gamed by responses that happen to produce correct final answers through flawed reasoning.
Process reward models (PRMs) score individual steps in a chain-of-thought reasoning sequence, rewarding each intermediate step that is logically correct and penalising steps that contain errors. PRMs are harder to train — they require step-level annotations — but produce better results for tasks requiring multi-step reasoning, such as mathematics, coding, and scientific question answering. OpenAI's "Let's verify step by step" paper (2023) demonstrated that PRMs substantially improved the quality of solutions to competition mathematics problems.
Evaluation: RewardBench
RewardBench, released by a coalition of academic and industry researchers in 2024 and presented at NAACL 2025, is the standard benchmark for evaluating reward model quality. It tests reward models on a diverse set of prompt categories — chat, reasoning, safety — using preference pairs drawn from human-annotated datasets. RewardBench 2, released at ICLR 2025, extended evaluation to more complex multi-turn dialogue and tool use scenarios.
See Also
References
- Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. NeurIPS 2022.
- Lightman, H., et al. (2023). Let's Verify Step by Step. arXiv:2305.20050. OpenAI.
- Lambert, N., et al. (2024). RewardBench: Evaluating Reward Models for Language Modeling. NAACL 2025.
- Bai, Y., et al. (2022). Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. arXiv:2204.05862. Anthropic.
- Zheng, L., et al. (2023). Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena. NeurIPS 2023.