AIWiki
Malaysia

Knowledge Distillation

Knowledge distillation is a model compression technique in which a smaller student neural network is trained to replicate the behaviour of a larger, more capable teacher model, enabling deployment of efficient models that approximate teacher-level performance.

6 min readLast updated May 2026Infrastructure

Knowledge distillation is a model compression and training technique in which a smaller, more efficient student model is trained to reproduce the behaviour of a larger, more accurate teacher model. Rather than training the student from scratch on raw labels, the student is trained to match the teacher's output distribution — including its confidence over incorrect classes — which conveys richer supervisory information than hard one-hot labels alone. The technique was introduced in its modern form by Geoffrey Hinton, Oriol Vinyals, and Jeff Dean in a 2015 paper, building on earlier work by Bucilua and colleagues from 2006.[^1]

Knowledge distillation has become essential in deploying large neural networks — including large language models, image classifiers, and speech recognition systems — to resource-constrained environments such as mobile phones, embedded systems, and IoT devices, where the full teacher model would be computationally or energetically infeasible.

Core Concept: Soft Targets and Dark Knowledge

The key insight of knowledge distillation is that the probability distribution produced by a softmax classifier over all possible output classes contains more information than the hard label for the correct class alone. Consider a teacher model classifying an image of a cat: the model might assign probability 0.85 to "cat," 0.08 to "tiger," and 0.04 to "lynx." These soft targets reflect the teacher's learned representation of inter-class similarities — the fact that cats resemble tigers more than they resemble chairs. Hinton termed this latent information dark knowledge.[^1]

When the student is trained to minimise the Kullback-Leibler divergence between its output distribution and the teacher's soft output distribution (in addition to the standard cross-entropy loss against true labels), it absorbs this inter-class relationship information, often achieving better generalisation than a student trained on hard labels alone — even when the student is much smaller.

A temperature parameter T is applied to the softmax to soften the teacher's outputs:

p_i = exp(z_i / T) / Σ_j exp(z_j / T)

where z_i are the raw logits and higher values of T produce a softer, more uniform probability distribution, enriching the supervisory signal from the teacher.[^1]

Variants and Extensions

Offline, Online, and Self-Distillation

In offline distillation, a teacher model is fully trained before the student training begins. In online distillation, teacher and student are trained simultaneously and mutually inform each other. In self-distillation, a model distils knowledge from its own earlier or deeper layers into shallower layers, improving performance without a separate teacher.

Feature-Based Distillation

Beyond output probabilities, distillation can align intermediate representations: the student is trained to match not only the teacher's final outputs but also the activations of specific intermediate layers. Methods such as FitNets[^2] and AT (Attention Transfer) use intermediate feature maps as additional supervisory signals, conveying richer structural information about how the teacher builds its representations.

Relation-Based Distillation

Relation-based methods transfer the relationships between data points as encoded by the teacher, rather than individual instance representations. The student is trained to preserve the pairwise or higher-order distances and angular similarities in the teacher's embedding space.

Distillation for Large Language Models

Knowledge distillation has become a primary technique for compressing large language models. The process of distilling a large LLM into a smaller one was applied at scale by Meta for the LLaMA and Phi model families, and by Mistral AI. Notable examples include DistilBERT (2019), which distilled BERT into a model 40% smaller with only a 3% performance decrease on the GLUE benchmark[^3], and Microsoft's Phi-1 and Phi-2 models, which achieve competitive performance through training on high-quality synthetic data generated by larger models — a distillation-adjacent approach.

Applications

Knowledge distillation has transformed the feasibility of deploying AI capabilities on edge devices.

Mobile applications including keyboard autocorrect, on-device voice recognition, and camera scene recognition use distilled models small enough to run locally without network connectivity. Google Translate's offline translation feature employs distilled neural machine translation models.

Embedded systems and IoT benefit from extremely compact distilled models that fit within kilobytes of memory, enabling anomaly detection and predictive maintenance on industrial sensors.

Real-time inference in latency-sensitive applications — autonomous driving perception, real-time video analysis, live speech transcription — requires distilled models that can meet strict latency requirements on available hardware.

Cloud cost reduction: even in cloud settings, a distilled model serving millions of requests per day at half the computation of its teacher can represent substantial infrastructure savings.

See Also

References

References

  1. Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the Knowledge in a Neural Network. arXiv:1503.02531. Presented at NIPS 2014 Deep Learning Workshop.
  2. Romero, A., Ballas, N., Kahou, S. E., Chassang, A., Gatta, C., & Bengio, Y. (2015). FitNets: Hints for Thin Deep Nets. ICLR 2015.
  3. Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108.