Self-Supervised Learning
A machine learning training paradigm in which a model generates its own supervisory signal from unlabelled data by solving pretext tasks, learning rich representations without human-annotated labels.
Self-supervised learning (SSL) is a machine learning training paradigm in which a model learns useful data representations by generating its own supervisory signal from the structure of unlabelled data, rather than relying on human-provided labels. The model is trained to solve one or more pretext tasks — auxiliary objectives whose answers are derived automatically from the data itself — in the expectation that the learned representations will transfer effectively to downstream tasks.
Self-supervised learning occupies a conceptual position between supervised and unsupervised learning. It produces labelled training pairs automatically (making it technically a form of supervised training), but the labels are derived from the data rather than from human annotation. Yann LeCun has described SSL as the foundation of what he calls "world model" learning, positioning it as essential for achieving more general AI capabilities.
Motivation
Labelling data at scale is expensive and time-consuming. In many domains — medical imaging, satellite remote sensing, industrial sensor data, low-resource languages — large annotated datasets do not exist. Self-supervised pretraining allows models to absorb information from vast unlabelled corpora, and then be fine-tuned on small labelled datasets for specific tasks, dramatically reducing labelling requirements while improving performance.
Language models such as GPT, BERT, and their successors are trained entirely with self-supervised objectives: predicting the next token (autoregressive) or reconstructing masked tokens (masked language modelling). This insight — that predicting withheld parts of the input is a powerful pretext task — has driven the large language model revolution.
Pretext Tasks
A pretext task is a self-defined prediction problem whose solution requires the model to develop semantically meaningful internal representations.
In natural language processing, the dominant pretext tasks are next-token prediction (used in GPT-family models) and masked token prediction (used in BERT). Both force the model to learn syntactic and semantic structure from raw text.
In computer vision, early pretext tasks included predicting the rotation angle of an image, solving jigsaw puzzles on image patches, and colourising greyscale images. These approaches demonstrated that models could learn useful visual features without labels, but their representations lagged behind supervised counterparts.
Contrastive learning emerged as a more powerful framework for visual SSL. The core idea is to train an encoder so that different views (augmented versions) of the same image are mapped to nearby points in the representation space, while views from different images are pushed apart. SimCLR (Chen et al., 2020) achieved results competitive with supervised pretraining by using strong data augmentations and a contrastive loss. MoCo (He et al., 2020) improved efficiency via a momentum encoder and a queue of negative samples.
Non-contrastive methods such as BYOL (Bootstrap Your Own Latent, Grill et al., 2020) and SimSiam (Chen and He, 2021) showed that competitive representations could be learned without explicit negative pairs, using only positive augmentation pairs and careful architectural choices to prevent representational collapse.
Masked autoencoders (MAE, He et al., 2022) extended masked prediction to vision by randomly masking a high fraction (typically 75 percent) of image patches and training a Vision Transformer to reconstruct the missing pixels. MAE proved highly scalable and became a foundation for large-scale vision pretraining.
Self-Supervised Learning in Foundation Models
SSL is the engine behind virtually all large foundation models. GPT-4, Claude, Llama, and Gemini are pretrained with autoregressive SSL on internet-scale text corpora. Visual foundation models such as CLIP (Radford et al., 2021) use a contrastive objective across image-caption pairs to learn aligned visual and language representations. DINO and DINOv2 (Oquab et al., 2023) achieve strong visual representations without labels by applying self-distillation with Vision Transformers.
The representations learned through SSL are then refined through fine-tuning, instruction tuning, and RLHF to produce task-specific behaviour. In this sense, SSL establishes the representational foundation upon which alignment and capability fine-tuning are built.
Recent Developments (2024-2026)
In 2025, hard negative mining was incorporated into contrastive training pipelines, improving quality by focusing gradient updates on difficult discriminative examples. Cross-modal SSL — learning representations that align signals across text, images, audio, and sensor modalities — became central to the development of multimodal foundation models. Researchers also explored applying contrastive SSL to graph-structured data, molecular graphs, and time series, extending the paradigm beyond image and text domains.
See Also
References
References
- Chen, T., et al. (2020). A simple framework for contrastive learning of visual representations. ICML 2020.
- Grill, J.-B., et al. (2020). Bootstrap your own latent: A new approach to self-supervised learning. NeurIPS 2020.
- He, K., et al. (2022). Masked autoencoders are scalable vision learners. CVPR 2022.
- Oquab, M., et al. (2023). DINOv2: Learning robust visual features without supervision. arXiv:2304.07193.
- Radford, A., et al. (2021). Learning transferable visual models from natural language supervision. ICML 2021.