What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Reinforcement Learning

A machine learning paradigm in which an agent learns to make sequential decisions by interacting with an environment and optimising for cumulative reward through trial and error.

7 min readLast updated June 2026Foundations

Reinforcement learning (RL) is a branch of machine learning in which a computational agent learns to make decisions by interacting with an environment over a sequence of timesteps. Unlike supervised learning, which trains on labelled input-output pairs, or unsupervised learning, which discovers structure in unlabelled data, RL requires no explicit ground-truth labels. Instead, the agent receives scalar feedback — a reward signal — that indicates how favourable its most recent action was, and it adjusts its behaviour to maximise the expected cumulative reward over time.

The framework draws on ideas from behavioural psychology, optimal control theory, and dynamic programming. Its modern computational form was formalised by Richard Sutton and Andrew Barto, whose 1998 textbook remains the canonical reference for the field.

Core Concepts

The Agent-Environment Loop

An RL system is described by a loop between two entities: the agent (the learner or decision-maker) and the environment (everything the agent interacts with). At each timestep the agent observes the current state of the environment, selects an action, and receives a reward along with the next state. This cycle continues until a terminal condition is met or indefinitely in continuing tasks.

Formally, most RL problems are modelled as Markov decision processes (MDPs), characterised by a state space S, an action space A, a transition function describing how actions move the environment between states, and a reward function that maps state-action pairs to scalar values. The Markov property requires that the next state depends only on the current state and action, not on the history of past states.

Policy and Value Functions

A policy is a mapping from states to actions (or distributions over actions) that defines the agent's behaviour. The agent's goal is to find an optimal policy that maximises the return — typically the discounted sum of future rewards, where rewards further in the future are down-weighted by a discount factor between 0 and 1.

Value functions estimate the expected return from a given state (state value) or from a given state-action pair (action value, often called the Q-value). Learning good value function estimates underpins many RL algorithms.

Major Algorithmic Families

Reinforcement learning algorithms fall into several broad categories.

Value-based methods learn a value function and derive a policy implicitly by selecting actions with the highest estimated value. Q-learning, introduced by Christopher Watkins in 1989, is the foundational value-based algorithm. Deep Q-Networks (DQN), developed by DeepMind in 2015, combined Q-learning with deep neural networks to achieve human-level performance on Atari games, marking a milestone for deep reinforcement learning.

Policy gradient methods directly optimise the policy by computing gradients of expected return with respect to the policy parameters. REINFORCE, Proximal Policy Optimisation (PPO), and Trust Region Policy Optimisation (TRPO) are prominent examples. PPO in particular became widely used because of its stability and sample efficiency.

Actor-critic methods combine value-based and policy-gradient approaches: an actor network proposes actions while a critic network estimates their value, reducing variance in gradient estimates. Advantage Actor-Critic (A2C) and Soft Actor-Critic (SAC) are well-known actor-critic algorithms.

Model-based methods learn an internal model of the environment's dynamics and use it for planning ahead, improving sample efficiency at the cost of model accuracy.

Deep Reinforcement Learning

The integration of deep neural networks with RL — termed deep reinforcement learning (deep RL) — has dramatically expanded the scale of problems RL can address. Notable achievements include DeepMind's AlphaGo (2016) and AlphaZero (2017), which mastered the board game Go at superhuman levels, and OpenAI Five, which defeated professional Dota 2 players in 2019.

Between 2024 and 2026, a significant trend emerged with Reinforcement Learning with Verifiable Rewards (RLVR), in which models are trained on tasks whose correctness can be verified automatically — mathematics, coding, and logical reasoning — without requiring human raters. OpenAI's o1 and o3 reasoning models leveraged RLVR to achieve dramatic gains on mathematical and scientific benchmarks.

Reinforcement Learning from Human Feedback

Reinforcement learning from human feedback (RLHF) is a specialised variant in which the reward signal comes from human preference ratings rather than a programmatic function. RLHF has become a cornerstone technique for aligning large language models with human values and instructions, used in the training of models such as ChatGPT, Claude, and Gemini.

Applications

RL has found application across a wide range of domains. In robotics, RL enables robots to learn manipulation tasks and locomotion policies directly from interaction. In autonomous driving, RL helps vehicles learn lane-following, merging, and parking strategies in simulation. In resource management, RL has been deployed to optimise data centre cooling (Google, 2016), reducing energy consumption by approximately 40 percent. In healthcare, RL informs adaptive treatment strategies and dosing protocols for chronic diseases. In finance, RL-based agents are used for portfolio optimisation and algorithmic trading.

The global reinforcement learning market was estimated at over USD 122 billion in 2025, reflecting accelerating enterprise adoption across these verticals.

Challenges

Despite its successes, RL faces several enduring challenges. Sample efficiency — the quantity of environment interaction required to learn a good policy — remains a limiting factor, particularly for real-world physical systems where data collection is expensive or dangerous. Reward hacking occurs when agents find unexpected ways to maximise reward that do not reflect the designer's intent, a manifestation of misalignment. Sparse rewards complicate training when positive feedback is rare. Generalisation across environments with different dynamics is an active research area.

Malaysian Context — Reinforcement Learning Adoption

Reinforcement learning is increasingly relevant to Malaysia's manufacturing and logistics sectors, both of which are priorities under the Malaysia Digital Economy Blueprint (MyDigital). Semiconductor and electronics manufacturers in Penang, Selangor, and Johor — including facilities operated by Intel, Infineon, and Bosch — are exploring RL-based process control to reduce defect rates and optimise throughput on production lines.

Petronas, Malaysia's national oil company, has investigated RL for drilling optimisation and predictive maintenance scheduling in offshore operations, where the cost of unplanned downtime is substantial. Similarly, Tenaga Nasional Berhad (TNB) has explored RL for grid load balancing as Malaysia integrates more renewable energy sources into its electricity network under the National Energy Transition Roadmap.

In logistics, Grab Malaysia and Pos Malaysia have applied RL-influenced approaches to route optimisation and last-mile delivery scheduling, domains where the agent-environment framing naturally maps to vehicle dispatch decisions. The Malaysia Airports Holdings Berhad (MAHB) has looked at RL for gate assignment and passenger flow management at KLIA.

From a talent and research perspective, Universiti Malaya (UM), Universiti Teknologi Malaysia (UTM), and Universiti Sains Malaysia (USM) have active research groups publishing in deep RL. MDEC has included RL as part of its AI upskilling curriculum under the Digital Skills for All initiative, and HRD Corp-accredited providers offer reinforcement learning modules within their machine learning professional development programmes.

Malaysia's proximity to Singapore, where research institutions such as A*STAR actively publish on RL for robotics, creates opportunities for cross-border collaboration within the ASEAN AI ecosystem. As Malaysia's National AI Office develops its implementation roadmap, RL is positioned as a priority technology for autonomous systems and industrial optimisation use cases.

References

Sutton, R. S., & Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press.
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518, 529-533.
Silver, D., et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.
Schulman, J., et al. (2017). Proximal policy optimization algorithms. arXiv:1707.06347.
DataRoot Labs. (2025). The State of Reinforcement Learning in 2025. datarootlabs.com.

Tags:reinforcement learning RL reward agent policy

Type	Machine learning paradigm
Key researchers	Richard Sutton, Andrew Barto, David Silver
Seminal work	Sutton & Barto, Reinforcement Learning: An Introduction (1998)
Key use	Game playing, robotics, autonomous vehicles, resource management
Related	Deep learning, RLHF, Q-learning, policy gradient methods

Core Concepts

The Agent-Environment Loop

Policy and Value Functions

Major Algorithmic Families

Deep Reinforcement Learning

Reinforcement Learning from Human Feedback

Applications

Challenges

See Also

References

References