What is AIWiki Malaysia?

AIWiki Malaysia is a free, open AI knowledge base covering artificial intelligence concepts, tools, models, and use cases — written specifically for Malaysian professionals and students. It is maintained by AITG Sdn Bhd, an AI company based in Penang.

Who maintains AIWiki Malaysia?

AIWiki Malaysia is maintained by AITG Sdn Bhd (Registration: 202601016521 (1678618-W)), an AI company headquartered in George Town, Penang, Malaysia. The editorial team continuously updates and expands the knowledge base.

What topics does AIWiki Malaysia cover?

AIWiki Malaysia covers a wide range of AI topics including large language models (LLMs), AI agents, machine learning fundamentals, prompt engineering, AI automation, generative AI tools, Malaysian AI regulations, local vendor landscape, and real-world AI use cases relevant to the Malaysian market.

How do I search for AI topics on AIWiki Malaysia?

You can use the search bar at the top of the site to find articles by keyword or topic. Articles are also organised by category, so you can browse by subject area such as Models, Tools, Concepts, or Use Cases.

Is AIWiki Malaysia available in Bahasa Malaysia?

Yes. AIWiki Malaysia publishes content in both English and Bahasa Malaysia to serve the full breadth of the Malaysian professional and student community. Language availability is indicated on each article page.

How can I submit a topic or suggest an article?

You can suggest topics or submit article ideas by contacting the AIWiki Malaysia team at admin@aiteragrid.com. AITG Sdn Bhd reviews all submissions and publishes content that meets editorial accuracy standards.

Knowledge Distillation

Knowledge distillation is a model compression technique in which a smaller student neural network is trained to replicate the behaviour of a larger, more capable teacher model, enabling deployment of efficient models that approximate teacher-level performance.

6 min readLast updated May 2026Infrastructure

Knowledge distillation is a model compression and training technique in which a smaller, more efficient student model is trained to reproduce the behaviour of a larger, more accurate teacher model. Rather than training the student from scratch on raw labels, the student is trained to match the teacher's output distribution — including its confidence over incorrect classes — which conveys richer supervisory information than hard one-hot labels alone. The technique was introduced in its modern form by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in a 2015 paper, building on earlier work by Bucilua and colleagues from 2006.[^1]

Knowledge distillation has become essential in deploying large neural networks — including large language models, image classifiers, and speech recognition systems — to resource-constrained environments such as mobile phones, embedded systems, and IoT devices, where the full teacher model would be computationally or energetically infeasible.

Core Concept: Soft Targets and Dark Knowledge

The key insight of knowledge distillation is that the probability distribution produced by a softmax classifier over all possible output classes contains more information than the hard label for the correct class alone. Consider a teacher model classifying an image of a cat: the model might assign probability 0.85 to "cat," 0.08 to "tiger," and 0.04 to "lynx." These soft targets reflect the teacher's learned representation of inter-class similarities — the fact that cats resemble tigers more than they resemble chairs. Hinton termed this latent information dark knowledge.[^1]

When the student is trained to minimise the Kullback-Leibler divergence between its output distribution and the teacher's soft output distribution (in addition to the standard cross-entropy loss against true labels), it absorbs this inter-class relationship information, often achieving better generalisation than a student trained on hard labels alone — even when the student is much smaller.

A temperature parameter T is applied to the softmax to soften the teacher's outputs:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

where z_i are the raw logits and higher values of T produce a softer, more uniform probability distribution, enriching the supervisory signal from the teacher.[^1]

Variants and Extensions

Offline, Online, and Self-Distillation

In offline distillation, a teacher model is fully trained before the student training begins. In online distillation, teacher and student are trained simultaneously and mutually inform each other. In self-distillation, a model distils knowledge from its own earlier or deeper layers into shallower layers, improving performance without a separate teacher.

Feature-Based Distillation

Beyond output probabilities, distillation can align intermediate representations: the student is trained to match not only the teacher's final outputs but also the activations of specific intermediate layers. Methods such as FitNets[^2] and AT (Attention Transfer) use intermediate feature maps as additional supervisory signals, conveying richer structural information about how the teacher builds its representations.

Relation-Based Distillation

Relation-based methods transfer the relationships between data points as encoded by the teacher, rather than individual instance representations. The student is trained to preserve the pairwise or higher-order distances and angular similarities in the teacher's embedding space.

Distillation for Large Language Models

Knowledge distillation has become a primary technique for compressing large language models. The process of distilling a large LLM into a smaller one was applied at scale by Meta for the LLaMA and Phi model families, and by Mistral AI. Notable examples include DistilBERT (2019), which distilled BERT into a model 40% smaller with only a 3% performance decrease on the GLUE benchmark[^3], and Microsoft's Phi-1 and Phi-2 models, which achieve competitive performance through training on high-quality synthetic data generated by larger models — a distillation-adjacent approach.

Applications

Knowledge distillation has transformed the feasibility of deploying AI capabilities on edge devices.

Mobile applications including keyboard autocorrect, on-device voice recognition, and camera scene recognition use distilled models small enough to run locally without network connectivity. Google Translate's offline translation feature employs distilled neural machine translation models.

Embedded systems and IoT benefit from extremely compact distilled models that fit within kilobytes of memory, enabling anomaly detection and predictive maintenance on industrial sensors.

Real-time inference in latency-sensitive applications — autonomous driving perception, real-time video analysis, live speech transcription — requires distilled models that can meet strict latency requirements on available hardware.

Cloud cost reduction: even in cloud settings, a distilled model serving millions of requests per day at half the computation of its teacher can represent substantial infrastructure savings.

Malaysian Context — Distillation for Edge AI and Device Manufacturing

Knowledge distillation is directly relevant to Malaysia's position as a leading global semiconductor and electronics manufacturing hub. Penang, Selangor, and Johor host facilities for multinational semiconductor companies including Intel, AMD, Infineon, STMicroelectronics, and NXP Semiconductors, whose chips are increasingly designed to run on-device AI inference. The trend toward AI-capable edge processors — including neural processing units (NPUs) — creates demand for distilled models small enough to run within tight power and memory budgets on these chips.

Malaysia's smart manufacturing initiatives, including the National Giga Initiative (which targets 5,000 factories adopting Industry 4.0 practices), rely on edge AI for real-time quality inspection, predictive maintenance, and process optimisation directly on the factory floor. Knowledge-distilled models enable manufacturers to deploy AI inference on embedded controllers and PLCs without depending on cloud connectivity — critical for air-gapped factory environments or rural manufacturing sites with limited bandwidth.

In the consumer electronics supply chain, Malaysia-based companies producing wearable devices, home routers, and smart home hardware (such as those manufactured for brands distributed regionally by companies like Sacofa, TM, and local OEM suppliers) increasingly embed distilled AI models for voice recognition, anomaly detection, and personalisation — capabilities that would be impractical without model compression.

Malaysian telecommunications operators Maxis, Celcom Axiata, and Telekom Malaysia (TM) have deployed network intelligence systems that use distilled models for real-time anomaly detection in network traffic and predictive maintenance of base station equipment, where inference latency requirements preclude cloud round-trips.

Universities including Universiti Teknikal Malaysia Melaka (UTeM), Universiti Malaysia Perlis (UniMAP), and Universiti Sains Malaysia (USM) — all with strong embedded systems and hardware engineering programmes — have incorporated knowledge distillation into research agendas aligned with Malaysia's National Technology and Innovation Sandbox (NTIS) priorities. HRD Corp-approved training providers include model compression techniques in their AI deployment and MLOps curriculum tracks as Malaysia trains the next generation of AI engineers capable of deploying models on the full spectrum from cloud to edge.

References

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. Presented at NIPS 2014 Deep Learning Workshop.
Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for Thin Deep Nets. ICLR 2015.
Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.

Tags:model compression efficiency deep learning on-device AI TinyML

Type	Model compression / transfer learning technique
Proposed by	Bucilua et al. (2006); formalised by Hinton et al. (2015)
Goal	Train a compact student model to match a larger teacher model
Key concept	Soft targets / dark knowledge from teacher output probabilities
Applications	On-device AI, real-time inference, edge deployment
Related	Quantisation, pruning, TinyML, model compression

Core Concept: Soft Targets and Dark Knowledge

Variants and Extensions

Offline, Online, and Self-Distillation

Feature-Based Distillation

Relation-Based Distillation

Distillation for Large Language Models

Applications

See Also

References

References