AIWiki
Malaysia

Multi-Task Learning

Multi-task learning is a machine learning approach in which a model is trained simultaneously on multiple related tasks, using shared representations to improve generalisation and data efficiency compared to training separate single-task models.

7 min readLast updated June 2026Foundations

Multi-task learning (MTL) is a training paradigm in which a single model is trained to perform multiple related tasks simultaneously, sharing representations across tasks to improve learning efficiency and generalisation. First formalised by Rich Caruana in a 1997 paper, multi-task learning has become a foundational technique in modern deep learning and underlies many of the most capable models in natural language processing and computer vision. The approach contrasts with the default practice of training a separate model for each task, which ignores potentially useful information shared between related problems.

Core Intuition

The central hypothesis of multi-task learning is that related tasks share underlying structure — statistical regularities, useful features, or domain knowledge — that a jointly trained model can exploit. When a model is trained on multiple tasks simultaneously, the gradient signals from each task constrain the shared parameters toward representations that are broadly useful, acting as a form of implicit regularisation. A model that performs well at sentiment analysis, named entity recognition, and text classification simultaneously must develop features that capture general linguistic structure, rather than overfitting to the idiosyncrasies of a single task's training data.

Caruana's original framing described this as using auxiliary tasks to bias the model toward representations that are generalisable. Tasks that share inductive biases — assumptions about what makes a good solution — benefit from joint training.

Hard vs. Soft Parameter Sharing

Multi-task learning in deep neural networks is implemented through two main architectural patterns.

Hard parameter sharing is the most common approach. A shared backbone network processes inputs from all tasks, with task-specific output heads branching off the final shared layers. All tasks share the backbone weights, and only the output heads are task-specific. This design is computationally efficient and reduces overfitting because the shared parameters must simultaneously satisfy the gradient demands of all tasks. The majority of parameters are shared, with only a small fraction dedicated to task-specific outputs.

Soft parameter sharing gives each task its own full network but adds regularisation losses that encourage the parameters of different task networks to remain similar to one another. Techniques such as L2 distance penalties between corresponding weights across task networks, or cross-task attention mechanisms that allow tasks to selectively borrow representations from one another, implement this pattern. Soft parameter sharing is more flexible but computationally more expensive.

Multi-Task Learning in Natural Language Processing

Multi-task learning has had particularly strong impact in NLP. Early influential work by Collobert and Weston (2008) trained a single neural network on POS tagging, chunking, named entity recognition, semantic role labelling, and language modelling simultaneously, demonstrating that shared representations substantially improved performance across all tasks compared to single-task baselines.

Modern large language models such as GPT and T5 are arguably multi-task learners at scale: they are pre-trained on diverse text prediction objectives (next-token prediction, span infilling, question answering) and then fine-tuned on mixed-task datasets. Instruction tuning, where models are fine-tuned on hundreds of diverse natural language tasks expressed in natural language, is a form of multi-task learning that improves zero-shot generalisation to new tasks.

Models such as T5 (Text-to-Text Transfer Transformer) explicitly frame all NLP tasks as text-to-text transformations and train on a mixture, yielding a versatile model that transfers well to new tasks.

Multi-Task Learning in Computer Vision

In computer vision, multi-task learning is used to jointly train models for object detection, semantic segmentation, depth estimation, and surface normal estimation. Models such as HydraNet train a shared convolutional backbone with task-specific decoder heads for each perception task. This is particularly valuable in autonomous driving, where a single model must simultaneously perform many perception tasks in real time on constrained hardware.

Multi-task learning also enables cross-modal transfer: a model trained jointly on image classification and image captioning may learn richer visual representations than one trained on classification alone, because the captioning task forces the model to encode semantic content expressible in natural language.

Challenges

Despite its benefits, multi-task learning introduces challenges not present in single-task training. Task interference occurs when the gradient updates required by one task conflict with those required by another, leading to degraded performance compared to single-task training. This is especially likely when tasks are semantically dissimilar or when one task is far larger than the others. Negative transfer describes the overall degradation of performance on target tasks due to the influence of unrelated auxiliary tasks.

Addressing task interference is an active research area. Techniques include gradient surgery (projecting gradients to remove conflicting components), uncertainty-weighted loss (scaling each task loss by a learned uncertainty to equalise their contribution), and task grouping (clustering tasks by similarity before joint training). The PCGrad and GradNorm methods are widely used to mitigate task interference in practice.

| Challenge | Description | Mitigation | |---|---|---| | Task interference | Conflicting gradients degrade performance | Gradient surgery, PCGrad | | Negative transfer | Auxiliary tasks hurt target task | Task similarity analysis, task grouping | | Loss scaling | Different task losses on different scales | Uncertainty weighting, GradNorm | | Task imbalance | Large tasks dominate training | Sampling strategies, loss normalisation |

Relationship to Transfer Learning and Foundation Models

Multi-task learning is closely related to transfer learning: both exploit shared structure across tasks, but multi-task learning trains on all tasks simultaneously while transfer learning trains sequentially (pre-training then fine-tuning). Foundation models such as GPT-4 and Claude are products of multi-task pre-training at scale, where exposure to diverse tasks during pre-training endows the model with versatile capabilities that transfer to downstream tasks.

See Also

References

  1. Caruana, R. (1997). Multitask Learning. Machine Learning, 28(1), 41-75. Kluwer Academic Publishers.
  2. Ruder, S. (2017). An Overview of Multi-Task Learning in Deep Neural Networks. arXiv:1706.05098.
  3. Collobert, R., and Weston, J. (2008). A Unified Architecture for Natural Language Processing: Deep Neural Networks with Multitask Learning. Proceedings of the 25th International Conference on Machine Learning (ICML).
  4. Yu, T., Kumar, S., Gupta, A., Levine, S., Hausman, K., and Finn, C. (2020). Gradient Surgery for Multi-Task Learning. Advances in Neural Information Processing Systems (NeurIPS).
  5. Fifty, C., Amid, E., Zhao, Z., Yu, T., Anil, R., and Finn, C. (2021). Efficiently Identifying Task Groupings for Multi-Task Learning. NeurIPS.