Zero-Shot Learning
Zero-shot learning is a machine learning paradigm in which a model makes accurate predictions on categories it has never seen during training by leveraging semantic descriptions or attribute representations.
Zero-shot learning (ZSL) is a machine learning setting in which a model is expected to correctly classify or process instances belonging to categories that it was never explicitly trained on. Rather than requiring labelled examples for every class, ZSL systems rely on an auxiliary source of knowledge — such as semantic attribute vectors, class descriptions in natural language, or knowledge graph embeddings — to bridge the gap between seen and unseen categories. The term "zero-shot" reflects the fact that zero labelled training examples are available for the target classes at inference time.
Motivation
Standard supervised learning assumes that the categories present at test time are the same as those seen during training. This closed-world assumption breaks down in many real-world scenarios where new classes emerge after a model is deployed, collecting labelled training data is expensive or impossible, or a model must operate across a large number of categories where exhaustive labelling is impractical. Zero-shot learning addresses these situations by enabling generalisation to novel categories without additional training.
How It Works
The central idea of ZSL is to transfer knowledge from seen classes to unseen classes through a shared semantic embedding space. In a typical ZSL pipeline:
- Feature extraction: A neural network maps each input instance to a high-dimensional feature vector.
- Semantic space: Each class — both seen and unseen — is described by a semantic vector. This may be hand-annotated attribute vectors (e.g., "has stripes", "can fly"), word embeddings of the class name, or sentence embeddings of a natural language description.
- Compatibility function: A model learns to align instance features with semantic class representations. At inference time, an unseen image is matched to the semantic representation of the unseen class that scores highest.
The key challenge is the hubness problem — in high-dimensional spaces, a small number of "hub" points tend to appear as nearest neighbours for many query points, degrading ZSL accuracy. Several methods address this through normalisation, calibration, or generative approaches.
Generalised Zero-Shot Learning
Practical deployments typically operate under Generalised Zero-Shot Learning (GZSL), where the model must classify instances from both seen and unseen classes simultaneously. GZSL is significantly harder than standard ZSL because models trained on seen classes tend to be biased towards predicting seen-class labels. Addressing this bias — through output calibration, generative data augmentation, or auxiliary classifiers — is an active research area.
Relationship to Large Language Models
The popularisation of large language models such as GPT-3, GPT-4, and their successors gave the term "zero-shot" an additional meaning in NLP. In the LLM context, "zero-shot prompting" refers to asking a model to perform a task — translation, classification, summarisation — by providing only an instruction and no examples. This is distinct from classical ZSL in computer vision, though both involve generalising without per-task training examples.
CLIP (Contrastive Language-Image Pre-training), released by OpenAI in 2021, bridged the two traditions by enabling zero-shot image classification purely via natural language descriptions, without any dedicated visual training for the target classes. A user can query CLIP with a text description of any category and it will identify matching images, even categories absent from standard visual training benchmarks.
Applications
Zero-shot learning has practical impact in settings where labelled data is scarce or expensive. In medical imaging, rare diseases can be classified from imaging data using clinical description text as the semantic anchor, without requiring enough cases to train a conventional classifier. In e-commerce, product images can be matched to catalogue categories that did not exist when the visual model was trained, using textual product metadata as the bridge. In cybersecurity, previously unseen malware families can be detected based on behavioural descriptions, without retraining the detection model.
In natural language processing, zero-shot and cross-lingual transfer are especially valuable. LLMs trained primarily on English can perform tasks in dozens of other languages without language-specific fine-tuning, using cross-lingual embeddings to bridge the gap between language representations.
Recent Advances
Research presented at ICLR 2025 demonstrated that combining diffusion model-based data augmentation with supervised contrastive learning — an approach called ZeroDiff — achieved 76.3% accuracy on standard ZSL benchmarks with 90% less training data than prior methods. These results underscore the continued relevance of zero-shot learning research even in an era dominated by foundation models.
References
- Larochelle, H., et al. (2008). Zero-data learning of new tasks. Proceedings of the 23rd AAAI Conference on Artificial Intelligence.
- Lampert, C. H., Nickisch, H., and Harmeling, S. (2009). Learning to detect unseen object classes by between-class attribute transfer. CVPR 2009.
- Xian, Y., et al. (2018). Zero-Shot Learning — A Comprehensive Evaluation of the Good, the Bad and the Ugly. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(9), 2251-2265.
- Radford, A., et al. (2021). Learning Transferable Visual Models From Natural Language Supervision (CLIP). OpenAI / arXiv:2103.00020.