Markov Decision Process
A Markov decision process is a mathematical framework for modelling sequential decision-making in which outcomes are partly random and partly under the control of a decision-maker.
A Markov decision process (MDP) is a discrete-time stochastic control framework used to model sequential decision-making under uncertainty. It formalises an environment in which an agent observes a state, selects an action, receives a reward, and transitions to a new state according to a probability distribution that depends only on the current state and action — the defining Markov property.
Formal definition
A finite MDP is a five-tuple (S, A, P, R, gamma) where S is a finite set of states, A is a finite set of actions, P(s_next | s, a) is the transition probability from state s to state s_next when action a is taken, R(s, a) is the expected immediate reward, and gamma in [0, 1) is a discount factor that weights future rewards. The Markov assumption states that the transition probability depends only on the present state and action, not on the history that led to that state.
A policy pi maps states to actions (deterministic policy) or to probability distributions over actions (stochastic policy). The value of a state under a policy is the expected discounted sum of future rewards, while the action-value function Q(s, a) is the expected return from taking action a in state s and following the policy thereafter.
Bellman equations
The Bellman expectation equation expresses the value of a state recursively in terms of the values of successor states. The Bellman optimality equation characterises the optimal value function, which satisfies a fixed-point relationship. Solving for the optimal policy reduces to solving these equations, either exactly through dynamic programming or approximately through reinforcement learning.
Solution methods
When the transition and reward functions are fully known, dynamic programming methods such as value iteration and policy iteration compute the optimal policy in polynomial time in the number of states and actions. Value iteration repeatedly applies the Bellman optimality operator until convergence, while policy iteration alternates policy evaluation and policy improvement steps.
When the dynamics are unknown, reinforcement learning algorithms estimate value functions from sampled experience. Model-free methods such as Q-learning and SARSA learn value functions directly, while model-based methods learn an approximation of the transition and reward functions and then plan within it. Modern deep reinforcement learning combines neural network function approximation with MDP-based update rules, as in Deep Q-Networks, proximal policy optimisation, and AlphaGo's Monte Carlo tree search.
Extensions
Several generalisations relax MDP assumptions. Partially observable MDPs (POMDPs) model situations where the agent observes noisy or incomplete information about the state. Constrained MDPs add safety or budget constraints on policies. Continuous MDPs replace finite state and action sets with continuous spaces and require function approximation. Multi-agent MDPs and Markov games extend the framework to multiple interacting decision-makers.
Applications
MDPs underpin a broad range of practical systems including robotic control, autonomous vehicles, recommendation systems, dynamic pricing, supply chain optimisation, clinical decision support, and game-playing agents. In large language model training, reinforcement learning from human feedback (RLHF) frames the response-generation process as an MDP in which states are partial outputs, actions are next tokens, and rewards come from a learned preference model.
References
- Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics.
- Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
- Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press.
- Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533.