What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Active Learning

Active learning is a machine learning paradigm in which the algorithm selectively queries a human annotator for labels on the most informative data points, minimising labelling effort while maximising model performance.

6 min readLast updated June 2026Infrastructure

Active learning is a machine learning paradigm in which the learning algorithm can interactively query a human annotator — referred to as an oracle — to obtain labels for selected data points. Rather than training on a pre-labelled dataset of arbitrary composition, an active learning system selects the specific examples it most needs labelled, typically those about which the current model is most uncertain or those expected to provide the greatest improvement in model performance. The central premise is that a model trained on a strategically selected subset of labelled data can match the performance of a model trained on a much larger randomly labelled dataset, thereby reducing the human labelling effort required to reach a given performance level.

Motivation

Obtaining labelled training data is frequently the most expensive and time-consuming part of deploying a supervised machine learning system. In domains such as medical imaging, legal document analysis, and scientific research, labelling requires expert annotators who are scarce and costly. Active learning addresses this bottleneck by ensuring that human annotation effort is concentrated on the examples that most advance model learning.

The contrast is with passive learning, in which training examples are selected randomly from the available pool without regard to their informativeness. Active learning has been shown empirically to reduce the number of labelled examples required to reach a given performance threshold by an order of magnitude or more in many tasks.

Query Strategies

The core algorithmic challenge in active learning is defining a query strategy — the criterion by which the model selects which unlabelled examples to present to the annotator.

Uncertainty sampling is the most commonly used strategy. The model selects examples for which it has the lowest confidence in its prediction. For binary classifiers, this means selecting examples whose predicted probability is closest to 0.5. For multi-class classifiers, uncertainty can be measured using the entropy of the predicted class distribution, the margin between the top two predicted class probabilities, or the probability of the most likely class (least confidence).

Query by committee involves maintaining an ensemble of models trained on the same labelled set, and selecting examples on which the models disagree most. The disagreement can be measured by vote entropy or Kullback-Leibler divergence between models' predictive distributions. This strategy is less sensitive to miscalibrated uncertainty estimates from individual models.

Diversity-based strategies aim to select examples that are representative of the unlabelled data distribution, preventing the queried set from being dominated by a narrow region of the input space. Core-set selection, for example, seeks to find a small labelled set such that the maximum distance from any unlabelled point to its nearest labelled point is minimised.

Expected model change and expected error reduction strategies select examples expected to cause the largest update to the model parameters or the largest reduction in generalisation error, respectively, though these are computationally expensive to evaluate directly.

Sampling Settings

Active learning can be structured in several ways depending on data availability. In pool-based active learning, the algorithm has access to a large pool of unlabelled data and selects the most informative subset to query. This is the most common setting in practice. In stream-based active learning, data arrives as a stream and the algorithm must decide in real time whether to query the label for each incoming instance. In membership query synthesis, the model may generate entirely new hypothetical instances to query, though this is less common in practice due to the challenge of generating realistic inputs.

Human-in-the-Loop Integration

Active learning is closely related to human-in-the-loop (HITL) machine learning, in which human judgement is incorporated into the model training or deployment pipeline. Modern active learning frameworks such as Label Studio, Prodigy, and Scale AI's data engine support iterative workflows where annotators label model-selected batches, a new model is trained, and the cycle repeats until a target performance level is reached.

Deep active learning extends these approaches to neural networks, where uncertainty can be estimated through methods such as Monte Carlo dropout (using dropout at inference time to obtain a distribution over predictions) or ensemble methods. Bayesian deep learning frameworks provide principled uncertainty estimates that can be used directly as acquisition functions.

Applications

Active learning has been applied to text classification, named entity recognition, relation extraction, image classification, object detection, medical image segmentation, speech recognition, and anomaly detection. In medical imaging, where expert radiologist time is scarce, active learning has enabled the development of diagnostic models using substantially fewer labelled scans than would be required with passive random sampling. In autonomous vehicle development, active learning selects the most informative driving scenarios for human labelling from the vast quantities of data recorded during fleet operations.

Malaysian Context — Active Learning in Malaysian AI Development

Active learning is particularly relevant to Malaysia's AI development context because of the scarcity of labelled training data in the Malay language and domain-specific Malaysian corpora. The high cost of expert annotation has motivated Malaysian AI practitioners to adopt active learning approaches for tasks including Bahasa Malaysia sentiment analysis, Malaysian court document classification, and medical record coding.

The Malaysia Digital Economy Corporation (MDEC) and the Ministry of Communications and Digital have identified data quality and data annotation as key enablers for AI adoption. Programmes under the National AI Office have encouraged the use of intelligent annotation tools incorporating active learning to build Malaysian language datasets more efficiently.

In the healthcare sector, the Ministry of Health Malaysia and hospital systems such as Hospital Kuala Lumpur have explored active learning for medical imaging annotation, where radiologist time is at a premium. By prioritising uncertain or diagnostically complex cases for expert review, active learning systems can reduce the number of scans requiring manual annotation to achieve a given diagnostic accuracy threshold.

Malaysian fintech companies and Bank Negara Malaysia (BNM)-regulated financial institutions have applied active learning to fraud detection model development, where labelled fraud examples are rare relative to the volume of legitimate transactions. Selectively querying labels for transactions near decision boundaries accelerates construction of effective fraud classifiers without requiring exhaustive manual review of transaction logs.

HRD Corp-funded AI training programmes offered by Malaysian technology providers have begun including active learning as a module within MLOps and data engineering curricula, recognising it as a practical tool for reducing the cost of AI deployment in the Malaysian enterprise context.

References

Settles, B. (2010). Active Learning Literature Survey. Computer Sciences Technical Report 1648, University of Wisconsin-Madison.
Ren, P. et al. (2021). A Survey of Deep Active Learning. ACM Computing Surveys, 54(9).
Sener, O. and Savarese, S. (2018). Active Learning for Convolutional Neural Networks: A Core-Set Approach. ICLR 2018.
Gal, Y. and Ghahramani, Z. (2016). Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning. ICML 2016.
Monarch, R.M. (2021). Human-in-the-Loop Machine Learning. Manning Publications.

Tags:active-learning data-labelling annotation machine-learning mlops

Type	Semi-supervised machine learning paradigm
Key benefit	Reduced annotation cost
Common strategies	Uncertainty sampling, query by committee, diversity sampling
Related concepts	Data labelling, semi-supervised learning, human-in-the-loop
Applications	NLP, computer vision, medical imaging, fraud detection