Reinforcement Learning
A machine learning paradigm in which an agent learns to make sequential decisions by interacting with an environment and optimising for cumulative reward through trial and error.
Reinforcement learning (RL) is a branch of machine learning in which a computational agent learns to make decisions by interacting with an environment over a sequence of timesteps. Unlike supervised learning, which trains on labelled input-output pairs, or unsupervised learning, which discovers structure in unlabelled data, RL requires no explicit ground-truth labels. Instead, the agent receives scalar feedback — a reward signal — that indicates how favourable its most recent action was, and it adjusts its behaviour to maximise the expected cumulative reward over time.
The framework draws on ideas from behavioural psychology, optimal control theory, and dynamic programming. Its modern computational form was formalised by Richard Sutton and Andrew Barto, whose 1998 textbook remains the canonical reference for the field.
Core Concepts
The Agent-Environment Loop
An RL system is described by a loop between two entities: the agent (the learner or decision-maker) and the environment (everything the agent interacts with). At each timestep the agent observes the current state of the environment, selects an action, and receives a reward along with the next state. This cycle continues until a terminal condition is met or indefinitely in continuing tasks.
Formally, most RL problems are modelled as Markov decision processes (MDPs), characterised by a state space S, an action space A, a transition function describing how actions move the environment between states, and a reward function that maps state-action pairs to scalar values. The Markov property requires that the next state depends only on the current state and action, not on the history of past states.
Policy and Value Functions
A policy is a mapping from states to actions (or distributions over actions) that defines the agent's behaviour. The agent's goal is to find an optimal policy that maximises the return — typically the discounted sum of future rewards, where rewards further in the future are down-weighted by a discount factor between 0 and 1.
Value functions estimate the expected return from a given state (state value) or from a given state-action pair (action value, often called the Q-value). Learning good value function estimates underpins many RL algorithms.
Major Algorithmic Families
Reinforcement learning algorithms fall into several broad categories.
Value-based methods learn a value function and derive a policy implicitly by selecting actions with the highest estimated value. Q-learning, introduced by Christopher Watkins in 1989, is the foundational value-based algorithm. Deep Q-Networks (DQN), developed by DeepMind in 2015, combined Q-learning with deep neural networks to achieve human-level performance on Atari games, marking a milestone for deep reinforcement learning.
Policy gradient methods directly optimise the policy by computing gradients of expected return with respect to the policy parameters. REINFORCE, Proximal Policy Optimisation (PPO), and Trust Region Policy Optimisation (TRPO) are prominent examples. PPO in particular became widely used because of its stability and sample efficiency.
Actor-critic methods combine value-based and policy-gradient approaches: an actor network proposes actions while a critic network estimates their value, reducing variance in gradient estimates. Advantage Actor-Critic (A2C) and Soft Actor-Critic (SAC) are well-known actor-critic algorithms.
Model-based methods learn an internal model of the environment's dynamics and use it for planning ahead, improving sample efficiency at the cost of model accuracy.
Deep Reinforcement Learning
The integration of deep neural networks with RL — termed deep reinforcement learning (deep RL) — has dramatically expanded the scale of problems RL can address. Notable achievements include DeepMind's AlphaGo (2016) and AlphaZero (2017), which mastered the board game Go at superhuman levels, and OpenAI Five, which defeated professional Dota 2 players in 2019.
Between 2024 and 2026, a significant trend emerged with Reinforcement Learning with Verifiable Rewards (RLVR), in which models are trained on tasks whose correctness can be verified automatically — mathematics, coding, and logical reasoning — without requiring human raters. OpenAI's o1 and o3 reasoning models leveraged RLVR to achieve dramatic gains on mathematical and scientific benchmarks.
Reinforcement Learning from Human Feedback
Reinforcement learning from human feedback (RLHF) is a specialised variant in which the reward signal comes from human preference ratings rather than a programmatic function. RLHF has become a cornerstone technique for aligning large language models with human values and instructions, used in the training of models such as ChatGPT, Claude, and Gemini.
Applications
RL has found application across a wide range of domains. In robotics, RL enables robots to learn manipulation tasks and locomotion policies directly from interaction. In autonomous driving, RL helps vehicles learn lane-following, merging, and parking strategies in simulation. In resource management, RL has been deployed to optimise data centre cooling (Google, 2016), reducing energy consumption by approximately 40 percent. In healthcare, RL informs adaptive treatment strategies and dosing protocols for chronic diseases. In finance, RL-based agents are used for portfolio optimisation and algorithmic trading.
The global reinforcement learning market was estimated at over USD 122 billion in 2025, reflecting accelerating enterprise adoption across these verticals.
Challenges
Despite its successes, RL faces several enduring challenges. Sample efficiency — the quantity of environment interaction required to learn a good policy — remains a limiting factor, particularly for real-world physical systems where data collection is expensive or dangerous. Reward hacking occurs when agents find unexpected ways to maximise reward that do not reflect the designer's intent, a manifestation of misalignment. Sparse rewards complicate training when positive feedback is rare. Generalisation across environments with different dynamics is an active research area.
See Also
References
References
- Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
- Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
- Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.
- Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
- DataRoot Labs. (2025). The State of Reinforcement Learning in 2025. datarootlabs.com.