AIWiki
Malaysia

Q-Learning

Q-learning is a model-free, off-policy reinforcement learning algorithm that learns the value of taking a given action in a given state, enabling an agent to derive an optimal policy through trial and error.

5 min readLast updated June 2026Foundations

Q-learning is a value-based, model-free, and off-policy reinforcement learning algorithm that enables an agent to learn how to act optimally in an environment through trial and error. The letter Q stands for quality, representing how valuable it is to take a particular action in a particular state. By repeatedly interacting with its environment and observing rewards, a Q-learning agent gradually estimates these action values and uses them to choose actions that maximise long-term reward. The algorithm was introduced by Christopher Watkins in 1989 and remains a foundational method in reinforcement learning.

Core idea

Reinforcement learning problems are typically framed as a Markov decision process, in which an agent in a given state takes an action, receives a reward, and transitions to a new state. The goal is to find a policy, a mapping from states to actions, that maximises the cumulative discounted reward over time. Q-learning approaches this by learning a function that assigns a value to each state-action pair, called a Q-value.

The algorithm updates these estimates using a rule derived from the Bellman optimality equation. After taking an action in a state and observing the reward and next state, the agent revises its estimate of that state-action value toward the observed reward plus the discounted value of the best action available in the next state. Written informally, the new estimate of Q(s, a) moves toward reward + gamma * max_a' Q(s', a'), where gamma is a discount factor between zero and one that weights future rewards. Over many interactions, these updates converge toward the true optimal values under suitable conditions.

Off-policy learning and exploration

Q-learning is described as off-policy because it learns the value of the optimal policy independently of the actions the agent actually takes while exploring. The update always assumes the best possible action is chosen in the next state, even if the agent's behaviour during training follows a different, more exploratory strategy. This separation between the behaviour policy used to gather experience and the target policy being learned is the defining feature of off-policy methods, and it allows experience to be reused efficiently.

To gather useful experience, Q-learning commonly uses an epsilon-greedy strategy, which chooses the action with the highest estimated value most of the time but occasionally selects a random action. This balances exploitation of known good actions against exploration of alternatives that might prove better.

| Property | Q-learning | | --- | --- | | Model required | No (model-free) | | Policy type | Off-policy | | Value learned | State-action (Q) values | | Exploration | Typically epsilon-greedy |

From tables to deep Q-networks

In its classic form, Q-learning stores values in a table with one entry per state-action pair, which works only when the number of states is small. For large or continuous state spaces, such as raw images, this becomes infeasible. The deep Q-network, introduced by DeepMind in 2015, replaced the table with a neural network that approximates Q-values, enabling agents to learn to play Atari games directly from pixels. This combination of Q-learning with deep learning helped launch the modern field of deep reinforcement learning.

References

  1. Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.
  2. Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
  3. Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518.
  4. DataCamp. An Introduction to Q-Learning: A Tutorial For Beginners.