AIWiki
Malaysia

Proximal Policy Optimization

A reinforcement learning algorithm developed by OpenAI that stabilises policy gradient training by constraining the size of policy updates, widely used for fine-tuning large language models through RLHF.

7 min readLast updated June 2026Foundations

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm developed by John Schulman and colleagues at OpenAI and introduced in a 2017 paper. PPO belongs to the family of policy gradient methods, which directly optimise the policy — the function that maps observations to actions — rather than estimating a value function and deriving a policy from it. The central innovation of PPO is a clipped objective function that limits how much the policy can change in a single training update, providing stability without the computational overhead of earlier methods such as Trust Region Policy Optimization (TRPO).

PPO became the standard algorithm for fine-tuning large language models using Reinforcement Learning from Human Feedback (RLHF), and was central to the training of models including OpenAI's InstructGPT, ChatGPT, and early versions of GPT-4. Its combination of simplicity, stability, and effectiveness has made it one of the most widely deployed reinforcement learning algorithms across robotics, game AI, and language model alignment.

Background: Policy Gradient Methods

Reinforcement learning trains an agent to maximise cumulative reward through interaction with an environment. Policy gradient methods optimise a parameterised policy directly by computing the gradient of expected reward with respect to policy parameters and taking gradient ascent steps.

A fundamental challenge in policy gradient learning is the step size problem. If the policy is updated too aggressively in a single step, the new policy may perform substantially worse than the old one, and because the data used to compute the gradient came from the old policy, the agent can enter a destructive cycle of updates. If updates are too conservative, learning is unnecessarily slow.

TRPO, PPO's predecessor, addressed this by constraining updates within a trust region defined by the KL divergence between the old and new policies. While theoretically principled, TRPO requires computing second-order derivatives and involves a constrained optimisation step, making it computationally expensive and difficult to implement.

The PPO Objective

PPO achieves a similar constraint through a simpler mechanism: a clipped surrogate objective. Rather than enforcing a hard constraint on KL divergence, PPO clips the probability ratio between the new and old policy to a range around 1.0, controlled by a hyperparameter epsilon (typically 0.1 to 0.2).

The probability ratio is defined as the probability of an action under the new policy divided by its probability under the old policy. When this ratio is within the clipping range, the standard policy gradient update applies. When the ratio exceeds the clipping range — meaning the new policy has moved too far from the old one — the clipping prevents further optimisation in that direction. This discourages large policy changes without requiring constrained optimisation.

The clipped objective is combined with a value function loss and an entropy bonus (which encourages exploration) into a single loss function that can be optimised with standard stochastic gradient descent.

Actor-Critic Architecture

PPO is implemented as an actor-critic method. The actor is the policy network that selects actions; the critic is a value network that estimates the expected return from a given state. The critic's value estimates are used to compute advantage estimates — measures of how much better or worse an action was compared to the average expected return from that state. Advantage estimates reduce the variance of the policy gradient update, improving training stability.

Generalised Advantage Estimation (GAE), introduced alongside PPO's development, is the standard method for computing advantages in PPO implementations. GAE balances bias and variance by exponentially weighting temporal-difference errors across multiple time steps.

Training Process

A typical PPO training loop:

  1. The current policy (actor) interacts with the environment for a fixed number of time steps, collecting a batch of transitions.
  2. The critic estimates value functions for each state in the batch.
  3. Advantage estimates are computed using GAE.
  4. Multiple epochs of minibatch gradient updates are performed on the clipped objective, value loss, and entropy bonus.
  5. The policy is updated, and the process repeats with new environment data.

The ability to perform multiple epochs of updates on each batch of collected data is a significant advantage over earlier policy gradient methods, which require a new batch of data for every gradient step. This makes PPO considerably more sample-efficient in practice.

Applications in Language Model Training

PPO's most prominent application in recent years has been as the optimisation algorithm in Reinforcement Learning from Human Feedback (RLHF), the technique used to align large language models with human preferences. In RLHF:

  1. A language model generates text completions for prompts.
  2. Human raters rank the completions according to quality, helpfulness, or safety.
  3. A reward model is trained to predict human ratings.
  4. PPO fine-tunes the language model to maximise rewards from the reward model while a KL penalty prevents the policy from deviating too far from the original supervised baseline.

This process, combined with supervised fine-tuning on human demonstrations, produced InstructGPT (2022) and the subsequent series of ChatGPT models. PPO's stability and ability to handle large neural network policies made it well-suited to this application.

In 2024 and 2025, alternative algorithms including Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) gained popularity as simpler alternatives to PPO for preference-based fine-tuning. GRPO, developed by DeepSeek researchers, was notably used to train DeepSeek-R1's reasoning capabilities. However, PPO remains widely used for tasks requiring a separate reward model and online data collection.

| Algorithm | Constraint Mechanism | Complexity | Sample Efficiency | |---|---|---|---| | REINFORCE | None | Low | Low | | TRPO | KL constraint (hard) | High | Moderate | | PPO | Clipped ratio (soft) | Low | Good | | DPO | No RL loop | Very low | Offline only | | GRPO | Group-relative reward | Moderate | Good |

References

  1. Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
  2. Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of ICML 2015.
  3. Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35.
  4. Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438.
  5. DigitalOcean. (2024). Proximal Policy Optimization: Implementation and Applications. DigitalOcean Community Tutorials.