Semi-Supervised Learning

Type: Machine learning paradigm
Uses: Small labelled set + large unlabelled set
Key methods: Pseudo-labelling, consistency regularisation, self-training
Motivation: Labelling data is costly; unlabelled data is abundant
Related: Supervised learning, self-supervised learning, active learning

Type: Machine learning paradigm
Uses: Small labelled set + large unlabelled set
Key methods: Pseudo-labelling, consistency regularisation, self-training
Motivation: Labelling data is costly; unlabelled data is abundant
Related: Supervised learning, self-supervised learning, active learning

Semi-supervised learning is a category of machine learning that sits between supervised learning, which relies entirely on labelled examples, and unsupervised learning, which uses none. A semi-supervised algorithm learns from a dataset in which only a small fraction of the examples carry labels while the majority remain unlabelled. The goal is to exploit the structure contained in the unlabelled data to build a more accurate model than the labelled examples alone would allow.

Why It Matters

In many real-world settings, collecting raw data is inexpensive but annotating it is slow, costly, or requires scarce expertise. Medical images must be labelled by clinicians, legal documents by lawyers, and speech recordings by trained transcribers. A hospital may hold millions of scans but only a few thousand annotated by radiologists. Semi-supervised learning addresses this imbalance by allowing a model to extract useful signal from the abundant unlabelled portion, reducing the number of labels needed to reach a target level of accuracy.

Underlying Assumptions

Semi-supervised methods work only when the unlabelled data carries information about the labelling task. Three assumptions are commonly invoked. The smoothness assumption holds that points close together in feature space are likely to share a label. The cluster assumption holds that data tends to form distinct clusters and that points in the same cluster usually belong to the same class, so decision boundaries should pass through low-density regions. The manifold assumption holds that high-dimensional data lies on a lower-dimensional manifold, and modelling that manifold simplifies classification. When these assumptions fail, unlabelled data can degrade rather than improve performance.

Common Techniques

Several families of methods dominate practice. Self-training, one of the oldest, trains an initial model on the labelled data, uses it to predict labels for unlabelled examples, and adds the most confident predictions, called pseudo-labels, back into the training set before retraining. Co-training extends this idea by training two models on different views of the data and letting each teach the other.

Consistency regularisation, central to many modern approaches, encourages a model to produce the same output when an unlabelled input is perturbed slightly, for example by augmentation or noise. Methods such as the Mean Teacher and FixMatch combine pseudo-labelling with strong data augmentation and consistency constraints, achieving strong results with very few labels. Graph-based methods represent examples as nodes and propagate labels across edges connecting similar points. Generative approaches model the joint distribution of inputs and labels so that unlabelled data helps estimate the input distribution.

Relationship to Neighbouring Fields

Semi-supervised learning is closely related to but distinct from several neighbouring paradigms. Self-supervised learning creates its own supervisory signal from unlabelled data through pretext tasks and typically feeds into later fine-tuning. Active learning selects which unlabelled examples would be most valuable to label next and queries a human annotator. Transfer learning reuses representations learned on one task for another. In modern pipelines these techniques are often combined: a model may be pretrained self-supervised, adapted with semi-supervised objectives, and refined with actively chosen labels.

Limitations

The main risk is confirmation bias in self-training, where early mistakes are reinforced as the model repeatedly trains on its own erroneous pseudo-labels. Performance is sensitive to the confidence threshold used to accept pseudo-labels and to the quality of augmentation. Because gains depend on the alignment between the unlabelled distribution and the task, semi-supervised learning is not a universal substitute for labelled data but a way to use it more efficiently.

Semi-supervised learning is well suited to Malaysian organisations that possess large volumes of raw data but limited annotation budgets. Hospitals under the Ministry of Health hold extensive medical imaging and records, yet clinician time for labelling is scarce; semi-supervised techniques let diagnostic models learn from mostly unlabelled scans while requiring only a modest set of expert annotations. Similar dynamics apply in banking, where institutions supervised by Bank Negara Malaysia accumulate vast transaction logs but can label only a fraction as fraudulent.

Language technology is a particularly strong fit. Bahasa Malaysia, along with Manglish and code-switched text common in Malaysia, is under-resourced compared with English, so labelled corpora are limited. Research groups behind Malaysian large language models such as MaLLaM and ILMU, and institutions including MIMOS and Universiti Malaya, can use semi-supervised and self-supervised methods to make the most of abundant unlabelled local text and speech.

Talent and infrastructure programmes support this work. MDEC and HRD Corp fund data science and machine learning training, while local AI data centres provide the compute needed to run augmentation-heavy semi-supervised pipelines within national jurisdiction, aligning with PDPA expectations for sensitive datasets. For Malaysian firms weighing the cost of large labelling exercises, semi-supervised learning offers a practical path to competitive models at lower expense.

Chapelle, O., Scholkopf, B., and Zien, A. (2006). Semi-Supervised Learning. MIT Press.
Sohn, K., et al. (2020). FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence. arXiv:2001.07685.
Tarvainen, A., and Valpola, H. (2017). Mean Teachers Are Better Role Models. arXiv:1703.01780.
van Engelen, J., and Hoos, H. (2020). A Survey on Semi-Supervised Learning. Machine Learning, 109.