Continual Learning
Continual learning is a machine learning paradigm in which models incrementally acquire knowledge from sequential tasks or data streams without forgetting previously learned information, addressing the stability-plasticity trade-off inherent in neural networks.
Continual learning is a subfield of machine learning concerned with training models that can acquire knowledge from a sequence of tasks or a non-stationary data stream, retaining previously learned capabilities while incorporating new information. The field directly addresses catastrophic forgetting, the well-documented tendency of neural networks to abruptly lose previously acquired knowledge when trained on new data. Continual learning is considered essential for building AI systems that can adapt to evolving environments without the expense and inefficiency of retraining from scratch on all accumulated data.
The Catastrophic Forgetting Problem
Standard neural network training assumes all training data is available simultaneously and drawn independently from a fixed distribution. When this assumption is violated — for example, when a model is first trained on task A and then fine-tuned on task B — the gradient updates for task B overwrite the weight configurations that encoded task A's knowledge. This phenomenon, termed catastrophic forgetting (also called catastrophic interference), was first described by McCloskey and Cohen in 1989 and remains a central challenge in neural network research.
Catastrophic forgetting arises from the distributed nature of neural representations: the same weights that encode knowledge about task A are also modified to encode task B, creating interference between the two. The severity depends on the similarity between tasks, the architecture of the network, and the learning rate used during adaptation.
The Stability-Plasticity Dilemma
Continual learning systems must navigate a fundamental tension known as the stability-plasticity dilemma. A plastic system adapts quickly to new data but risks overwriting existing knowledge. A stable system preserves existing knowledge but resists learning new patterns efficiently. Biological neural systems resolve this dilemma through mechanisms such as synaptic consolidation, complementary memory systems (the hippocampus for rapid encoding and the cortex for slow consolidation), and sleep-based memory replay. Artificial systems must find computational analogues to these biological strategies.
Continual Learning Scenarios
Researchers have defined three canonical scenarios to evaluate continual learning methods.
Task-incremental learning (Task-IL) is the simplest scenario: the system is given a task identifier at test time and must perform the correct task. The challenge is isolating task-specific knowledge.
Domain-incremental learning (Domain-IL) presents the same type of task across different data distributions (for example, classifying objects in different visual domains), without the task identifier at test time.
Class-incremental learning (Class-IL) is the most challenging scenario: the model must classify among all classes seen so far, without knowing which subset of classes the current input belongs to. New classes are added over time, and the model must not regress on earlier classes.
Approaches to Continual Learning
Regularisation-Based Methods
Regularisation approaches add penalty terms to the loss function that discourage changes to weights that were important for previous tasks. Elastic Weight Consolidation (EWC), proposed by Kirkpatrick et al. in 2017, estimates the importance of each weight using the Fisher information matrix and penalises deviations from earlier values proportionally to their importance. Synaptic Intelligence (SI) and Progressive Neural Networks are related approaches that track weight importance during training rather than computing it post-hoc.
Rehearsal-Based Methods
Rehearsal methods maintain a memory buffer containing a subset of examples from previous tasks and mix these stored examples with new training data to prevent forgetting. Experience Replay directly replays stored samples. Generative Replay uses a generative model trained on past tasks to synthesise pseudo-samples, avoiding the need to store real data (relevant for privacy-sensitive applications). Dark Experience Replay (DER) stores model logits rather than raw samples, preserving richer information at a similar memory cost.
Architecture-Based Methods
Architectural approaches allocate different parameters for different tasks, avoiding interference by construction. Progressive Neural Networks add new network columns for each new task and freeze previously learned columns, preserving prior knowledge at the cost of growing model size. PackNet and HAT (Hard Attention to the Task) use masks to identify and protect task-specific subnetworks within a fixed-capacity architecture.
Prompt-Based Methods
Recent work has adapted continual learning for pre-trained transformer models by learning task-specific prompt vectors while freezing the backbone model. Methods such as L2P (Learning to Prompt), DualPrompt, and CODA-Prompt achieve strong continual learning performance on image classification benchmarks by concentrating task-specific information in small prompt parameters, leaving the large pre-trained backbone untouched.
| Approach | Key Idea | Memory Overhead | Scalability | |---|---|---|---| | EWC | Penalise important weight changes | Low | Moderate | | Experience Replay | Store and replay past samples | Medium | Good | | Generative Replay | Synthesise past data | Model size | Good | | Progressive Networks | Separate columns per task | High (grows) | Limited | | Prompt-based | Task-specific prompts, frozen backbone | Very low | High |
Continual Learning for Large Language Models
With the rise of large language models, continual learning has taken on new importance. Deployed LLMs become stale as the world changes: new entities emerge, facts change, and user needs evolve. Continual learning for LLMs aims to update model knowledge incrementally without full retraining (which can cost millions of dollars). Continual fine-tuning with methods such as LoRA combined with regularisation or replay has shown promise for domain adaptation without catastrophic forgetting of general capabilities.
See Also
References
- McCloskey, M., and Cohen, N. J. (1989). Catastrophic interference in connectionist networks: The sequential learning problem. Psychology of Learning and Motivation, 24, 109-165. Academic Press.
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N., et al. (2017). Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13), 3521-3526.
- van de Ven, G. M., Tuytelaars, T., and Tolias, A. S. (2024). Continual Learning and Catastrophic Forgetting. arXiv:2403.05175.
- Wang, Z., Zhang, Z., Lee, C. Y., et al. (2022). Learning to Prompt for Continual Learning. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).