What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Proximal Policy Optimization

A reinforcement learning algorithm developed by OpenAI that stabilises policy gradient training by constraining the size of policy updates, widely used for fine-tuning large language models through RLHF.

7 min readLast updated June 2026Foundations

Proximal Policy Optimization (PPO) is a reinforcement learning algorithm developed by John Schulman and colleagues at OpenAI and introduced in a 2017 paper. PPO belongs to the family of policy gradient methods, which directly optimise the policy — the function that maps observations to actions — rather than estimating a value function and deriving a policy from it. The central innovation of PPO is a clipped objective function that limits how much the policy can change in a single training update, providing stability without the computational overhead of earlier methods such as Trust Region Policy Optimization (TRPO).

PPO became the standard algorithm for fine-tuning large language models using Reinforcement Learning from Human Feedback (RLHF), and was central to the training of models including OpenAI's InstructGPT, ChatGPT, and early versions of GPT-4. Its combination of simplicity, stability, and effectiveness has made it one of the most widely deployed reinforcement learning algorithms across robotics, game AI, and language model alignment.

Background: Policy Gradient Methods

Reinforcement learning trains an agent to maximise cumulative reward through interaction with an environment. Policy gradient methods optimise a parameterised policy directly by computing the gradient of expected reward with respect to policy parameters and taking gradient ascent steps.

A fundamental challenge in policy gradient learning is the step size problem. If the policy is updated too aggressively in a single step, the new policy may perform substantially worse than the old one, and because the data used to compute the gradient came from the old policy, the agent can enter a destructive cycle of updates. If updates are too conservative, learning is unnecessarily slow.

TRPO, PPO's predecessor, addressed this by constraining updates within a trust region defined by the KL divergence between the old and new policies. While theoretically principled, TRPO requires computing second-order derivatives and involves a constrained optimisation step, making it computationally expensive and difficult to implement.

The PPO Objective

PPO achieves a similar constraint through a simpler mechanism: a clipped surrogate objective. Rather than enforcing a hard constraint on KL divergence, PPO clips the probability ratio between the new and old policy to a range around 1.0, controlled by a hyperparameter epsilon (typically 0.1 to 0.2).

The probability ratio is defined as the probability of an action under the new policy divided by its probability under the old policy. When this ratio is within the clipping range, the standard policy gradient update applies. When the ratio exceeds the clipping range — meaning the new policy has moved too far from the old one — the clipping prevents further optimisation in that direction. This discourages large policy changes without requiring constrained optimisation.

The clipped objective is combined with a value function loss and an entropy bonus (which encourages exploration) into a single loss function that can be optimised with standard stochastic gradient descent.

Actor-Critic Architecture

PPO is implemented as an actor-critic method. The actor is the policy network that selects actions; the critic is a value network that estimates the expected return from a given state. The critic's value estimates are used to compute advantage estimates — measures of how much better or worse an action was compared to the average expected return from that state. Advantage estimates reduce the variance of the policy gradient update, improving training stability.

Generalised Advantage Estimation (GAE), introduced alongside PPO's development, is the standard method for computing advantages in PPO implementations. GAE balances bias and variance by exponentially weighting temporal-difference errors across multiple time steps.

Training Process

A typical PPO training loop:

The current policy (actor) interacts with the environment for a fixed number of time steps, collecting a batch of transitions.
The critic estimates value functions for each state in the batch.
Advantage estimates are computed using GAE.
Multiple epochs of minibatch gradient updates are performed on the clipped objective, value loss, and entropy bonus.
The policy is updated, and the process repeats with new environment data.

The ability to perform multiple epochs of updates on each batch of collected data is a significant advantage over earlier policy gradient methods, which require a new batch of data for every gradient step. This makes PPO considerably more sample-efficient in practice.

Applications in Language Model Training

PPO's most prominent application in recent years has been as the optimisation algorithm in Reinforcement Learning from Human Feedback (RLHF), the technique used to align large language models with human preferences. In RLHF:

A language model generates text completions for prompts.
Human raters rank the completions according to quality, helpfulness, or safety.
A reward model is trained to predict human ratings.
PPO fine-tunes the language model to maximise rewards from the reward model while a KL penalty prevents the policy from deviating too far from the original supervised baseline.

This process, combined with supervised fine-tuning on human demonstrations, produced InstructGPT (2022) and the subsequent series of ChatGPT models. PPO's stability and ability to handle large neural network policies made it well-suited to this application.

In 2024 and 2025, alternative algorithms including Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO) gained popularity as simpler alternatives to PPO for preference-based fine-tuning. GRPO, developed by DeepSeek researchers, was notably used to train DeepSeek-R1's reasoning capabilities. However, PPO remains widely used for tasks requiring a separate reward model and online data collection.

| Algorithm | Constraint Mechanism | Complexity | Sample Efficiency | |---|---|---|---| | REINFORCE | None | Low | Low | | TRPO | KL constraint (hard) | High | Moderate | | PPO | Clipped ratio (soft) | Low | Good | | DPO | No RL loop | Very low | Offline only | | GRPO | Group-relative reward | Moderate | Good |

Malaysian Context — Reinforcement Learning Research and Applications

Reinforcement learning research, including PPO, is active at several Malaysian universities. Universiti Malaya's Faculty of Computer Science and Information Technology and Universiti Teknologi Malaysia's School of Computing have published research applying PPO to robotics, autonomous navigation, and resource scheduling problems. The Malaysian Communications and Multimedia Commission (MCMC) and MDEC have both cited reinforcement learning as a critical capability area under the National AI Roadmap.

In the private sector, Malaysian autonomous systems companies applying PPO include those working on warehouse robotics and automated guided vehicles (AGVs), an area of growth in the Penang and Selangor manufacturing corridors. Companies supplying intelligent automation to semiconductor manufacturers and electronics assembly facilities explore PPO-based control systems for robotic arms and quality inspection workflows.

The application of PPO to large language model training via RLHF is directly relevant to Malaysia's Sovereign AI ambitions. The ILMU project — Malaysia's national large language model developed in partnership with YTL and NVIDIA — involves fine-tuning models on Bahasa Malaysia and Malaysian context. RLHF using PPO is expected to be a component of aligning such models with Malaysian cultural values and linguistic norms, though the specific training methodologies used in ILMU have not been publicly disclosed.

Researchers at Universiti Sains Malaysia and Multimedia University (MMU) have explored reinforcement learning for optimising network resource allocation in 5G and beyond networks — an application that aligns with Malaysia's National Fiberisation and Connectivity Plan (NFCP) and the digital infrastructure priorities of the MyDigital Blueprint. PPO's stability advantages make it practical for these continuous control problems.

HRD Corp-registered AI training providers in Malaysia increasingly include reinforcement learning fundamentals — including policy gradient methods and PPO — in advanced machine learning certification programmes, reflecting growing industry demand for practitioners capable of working with state-of-the-art training algorithms.

References

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
Schulman, J., Levine, S., Abbeel, P., Jordan, M., & Moritz, P. (2015). Trust region policy optimization. Proceedings of ICML 2015.
Ouyang, L., et al. (2022). Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35.
Schulman, J., Moritz, P., Levine, S., Jordan, M., & Abbeel, P. (2015). High-dimensional continuous control using generalized advantage estimation. arXiv:1506.02438.
DigitalOcean. (2024). Proximal Policy Optimization: Implementation and Applications. DigitalOcean Community Tutorials.

Tags:reinforcement learning policy gradient RLHF fine-tuning OpenAI

Abbreviation	PPO
Developed by	OpenAI
Published	2017
Type	Policy gradient reinforcement learning
Key use	RLHF for large language models, robotics, game AI
Related	Reinforcement learning, RLHF, Actor-critic methods

Background: Policy Gradient Methods

The PPO Objective

Actor-Critic Architecture

Training Process

Applications in Language Model Training

Comparison with Related Algorithms

References