What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Q-Learning

Q-learning is a model-free, off-policy reinforcement learning algorithm that learns the value of taking a given action in a given state, enabling an agent to derive an optimal policy through trial and error.

5 min readLast updated June 2026Foundations

Q-learning is a value-based, model-free, and off-policy reinforcement learning algorithm that enables an agent to learn how to act optimally in an environment through trial and error. The letter Q stands for quality, representing how valuable it is to take a particular action in a particular state. By repeatedly interacting with its environment and observing rewards, a Q-learning agent gradually estimates these action values and uses them to choose actions that maximise long-term reward. The algorithm was introduced by Christopher Watkins in 1989 and remains a foundational method in reinforcement learning.

Core idea

Reinforcement learning problems are typically framed as a Markov decision process, in which an agent in a given state takes an action, receives a reward, and transitions to a new state. The goal is to find a policy, a mapping from states to actions, that maximises the cumulative discounted reward over time. Q-learning approaches this by learning a function that assigns a value to each state-action pair, called a Q-value.

The algorithm updates these estimates using a rule derived from the Bellman optimality equation. After taking an action in a state and observing the reward and next state, the agent revises its estimate of that state-action value toward the observed reward plus the discounted value of the best action available in the next state. Written informally, the new estimate of Q(s, a) moves toward reward + gamma * max_a' Q(s', a'), where gamma is a discount factor between zero and one that weights future rewards. Over many interactions, these updates converge toward the true optimal values under suitable conditions.

Off-policy learning and exploration

Q-learning is described as off-policy because it learns the value of the optimal policy independently of the actions the agent actually takes while exploring. The update always assumes the best possible action is chosen in the next state, even if the agent's behaviour during training follows a different, more exploratory strategy. This separation between the behaviour policy used to gather experience and the target policy being learned is the defining feature of off-policy methods, and it allows experience to be reused efficiently.

To gather useful experience, Q-learning commonly uses an epsilon-greedy strategy, which chooses the action with the highest estimated value most of the time but occasionally selects a random action. This balances exploitation of known good actions against exploration of alternatives that might prove better.

| Property | Q-learning | | --- | --- | | Model required | No (model-free) | | Policy type | Off-policy | | Value learned | State-action (Q) values | | Exploration | Typically epsilon-greedy |

From tables to deep Q-networks

In its classic form, Q-learning stores values in a table with one entry per state-action pair, which works only when the number of states is small. For large or continuous state spaces, such as raw images, this becomes infeasible. The deep Q-network, introduced by DeepMind in 2015, replaced the table with a neural network that approximates Q-values, enabling agents to learn to play Atari games directly from pixels. This combination of Q-learning with deep learning helped launch the modern field of deep reinforcement learning.

Malaysian Context — Reinforcement Learning Research and Applications

Q-learning and reinforcement learning more broadly are research and teaching topics at Malaysian universities, including Universiti Malaya, Universiti Sains Malaysia, Universiti Teknologi Malaysia, and Universiti Teknologi PETRONAS, where they appear in robotics, control systems, and intelligent systems courses. MIMOS, the national applied research institute, also explores adaptive and autonomous systems relevant to these methods.

Applied use of reinforcement learning in Malaysia is concentrated in areas such as robotics and automation in manufacturing, energy management, and logistics optimisation. The electronics and semiconductor plants in Penang and the Klang Valley use automation that can benefit from learning-based control, while logistics and ride-hailing operators such as Grab, headquartered in the region, work on routing and resource allocation problems where reinforcement learning techniques are relevant. Petronas and energy utilities explore optimisation of industrial processes and grid operations.

Practical deployment in Malaysia faces the same constraints seen globally: reinforcement learning is data-hungry and often requires accurate simulators, making the expanding local data centre capacity in Johor and Cyberjaya and improving access to GPU compute important enablers. Talent development through the Human Resources Development Corporation (HRD Corp) and MDEC supports the growth of expertise in these advanced methods.

For most Malaysian organisations, reinforcement learning remains a specialised tool applied to well-defined optimisation and control problems rather than a general-purpose technology, but its importance grows as autonomous systems and industrial automation expand.

References

Watkins, C. J. C. H. (1989). Learning from Delayed Rewards. PhD thesis, University of Cambridge.
Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). MIT Press.
Mnih, V., et al. (2015). Human-level control through deep reinforcement learning. Nature, 518.
DataCamp. An Introduction to Q-Learning: A Tutorial For Beginners.

Tags:Q-learning reinforcement learning off-policy Bellman equation value-based

Type	Reinforcement learning algorithm
Category	Value-based, model-free, off-policy
Proposed by	Christopher Watkins
Year	1989
Foundation	Bellman optimality equation
Related	Markov decision process, DQN