What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Markov Decision Process

A Markov decision process is a mathematical framework for modelling sequential decision-making in which outcomes are partly random and partly under the control of a decision-maker.

4 min readLast updated May 2026Foundations

A Markov decision process (MDP) is a discrete-time stochastic control framework used to model sequential decision-making under uncertainty. It formalises an environment in which an agent observes a state, selects an action, receives a reward, and transitions to a new state according to a probability distribution that depends only on the current state and action — the defining Markov property.

Formal definition

A finite MDP is a five-tuple (S, A, P, R, gamma) where S is a finite set of states, A is a finite set of actions, P(s_next | s, a) is the transition probability from state s to state s_next when action a is taken, R(s, a) is the expected immediate reward, and gamma in [0, 1) is a discount factor that weights future rewards. The Markov assumption states that the transition probability depends only on the present state and action, not on the history that led to that state.

A policy pi maps states to actions (deterministic policy) or to probability distributions over actions (stochastic policy). The value of a state under a policy is the expected discounted sum of future rewards, while the action-value function Q(s, a) is the expected return from taking action a in state s and following the policy thereafter.

Bellman equations

The Bellman expectation equation expresses the value of a state recursively in terms of the values of successor states. The Bellman optimality equation characterises the optimal value function, which satisfies a fixed-point relationship. Solving for the optimal policy reduces to solving these equations, either exactly through dynamic programming or approximately through reinforcement learning.

Solution methods

When the transition and reward functions are fully known, dynamic programming methods such as value iteration and policy iteration compute the optimal policy in polynomial time in the number of states and actions. Value iteration repeatedly applies the Bellman optimality operator until convergence, while policy iteration alternates policy evaluation and policy improvement steps.

When the dynamics are unknown, reinforcement learning algorithms estimate value functions from sampled experience. Model-free methods such as Q-learning and SARSA learn value functions directly, while model-based methods learn an approximation of the transition and reward functions and then plan within it. Modern deep reinforcement learning combines neural network function approximation with MDP-based update rules, as in Deep Q-Networks, proximal policy optimisation, and AlphaGo's Monte Carlo tree search.

Extensions

Several generalisations relax MDP assumptions. Partially observable MDPs (POMDPs) model situations where the agent observes noisy or incomplete information about the state. Constrained MDPs add safety or budget constraints on policies. Continuous MDPs replace finite state and action sets with continuous spaces and require function approximation. Multi-agent MDPs and Markov games extend the framework to multiple interacting decision-makers.

Applications

MDPs underpin a broad range of practical systems including robotic control, autonomous vehicles, recommendation systems, dynamic pricing, supply chain optimisation, clinical decision support, and game-playing agents. In large language model training, reinforcement learning from human feedback (RLHF) frames the response-generation process as an MDP in which states are partial outputs, actions are next tokens, and rewards come from a learned preference model.

Malaysian Context — MDP Applications in Industry

MDPs and reinforcement learning techniques have been applied in several Malaysian contexts. Bank Negara Malaysia-regulated institutions including Maybank, CIMB, and Public Bank use MDP-based models for portfolio rebalancing and dynamic credit limit management within the constraints set out by BNM's Risk Management in Technology policy. Tenaga Nasional Berhad has piloted reinforcement learning for grid optimisation as part of its Energy Transition initiative.

In transport, Grab Malaysia uses MDP-derived policies for dispatch and dynamic pricing across its ride-hailing and delivery network. Prasarana Malaysia and KTM Berhad have explored MDP-based scheduling for rail operations. Petronas applies reinforcement learning models for upstream operational decisions in oil and gas extraction, integrated with predictive maintenance.

Academic research on MDPs is active at Universiti Malaya, Universiti Putra Malaysia, and Universiti Teknologi PETRONAS. The Centre of Applied Data Science (CADS) and the Malaysia Board of Technologists offer professional certifications that include reinforcement learning fundamentals. HRD Corp Claimable Courses subsidise industry training in reinforcement learning, with vendor-led programmes delivered through MDEC's Premier Digital Tech Institutions.

References

Bellman, R. (1957). A Markovian Decision Process. Journal of Mathematics and Mechanics.
Puterman, M. L. (1994). Markov Decision Processes: Discrete Stochastic Dynamic Programming. Wiley.
Sutton, R. S. and Barto, A. G. (2018). Reinforcement Learning: An Introduction, 2nd ed. MIT Press.
Mnih, V. et al. (2015). Human-level control through deep reinforcement learning. Nature 518, 529–533.

Tags:MDP reinforcement learning probability Bellman equation optimal control

Type	Discrete-time stochastic control process
Introduced	1957, Richard Bellman
Key components	States, actions, transitions, rewards
Key equation	Bellman equation
Used in	Reinforcement learning, operations research