Emergent Abilities
Capabilities that appear in large language models only once they reach a certain scale, and which are not present in smaller models, along with the debate over whether such abilities are real or measurement artifacts.
Emergent abilities are capabilities that large language models exhibit at large scale but that are not present, or are near random, in smaller models of the same family. The term was popularised in 2022 to describe the observation that certain skills, such as multi-step arithmetic, answering questions in specialised domains, or following instructions, seemed to appear suddenly once a model exceeded a threshold in size or training compute, rather than improving smoothly as the model grew. The concept became influential in discussions of how model capabilities scale and in debates about how predictable and how safe increasingly large models are.
The original observation
Researchers plotting model performance against scale reported that on some tasks a model's accuracy remained close to chance across a wide range of sizes and then rose sharply beyond a certain point. Because the improvement looked like a phase transition rather than gradual progress, these abilities were called emergent, borrowing a term from complex systems where qualitatively new behaviour appears at scale. If genuine, emergence has important implications: it would mean that simply making models larger could unlock unforeseen capabilities that were not visible in smaller versions, making capabilities hard to predict in advance.
The mirage critique
The claim was challenged in an influential 2023 paper, "Are Emergent Abilities of Large Language Models a Mirage?", by Rylan Schaeffer, Brando Miranda, and Sanmi Koyejo, which received a NeurIPS outstanding paper award. The authors argued that apparent emergence can be an artifact of the evaluation metric rather than a real change in model behaviour. Many benchmarks use harsh all-or-nothing scoring, for example marking a long arithmetic answer correct only if every digit is right. Under such a discontinuous metric, a model whose underlying competence is improving smoothly can appear to jump from zero to success once it crosses a hidden threshold. When the same tasks are scored with continuous or more forgiving metrics, the paper showed, the sharp jumps often smooth out into gradual, predictable improvement.
This critique reframed part of the debate as one about measurement. It did not claim that large models lack impressive capabilities, only that the specific appearance of sudden, unpredictable emergence can often be traced to how performance is measured.
Current understanding
The discussion has since become more nuanced. Analysts distinguish between the strong claim that capabilities appear unpredictably at scale and the weaker, well-supported observation that larger models are simply more capable across many tasks. Some researchers note that even if a metric is the proximate cause of a discontinuity, discontinuous metrics can still be the ones that matter in practice, since many real applications care about getting an answer entirely right rather than partially right. Work has also explored predicting future capabilities by fine-tuning or by extrapolating trends, aiming to make capability forecasting more reliable. Emergent abilities remain a touchstone in conversations about neural scaling laws, evaluation design, and AI safety, where the predictability of capability gains bears directly on how carefully new models should be tested before release.
References
- Wei, J., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research.
- Schaeffer, R., Miranda, B., and Koyejo, S. (2023). Are Emergent Abilities of Large Language Models a Mirage?. NeurIPS.
- Center for Security and Emerging Technology. (2023). Emergent Abilities in Large Language Models: An Explainer. cset.georgetown.edu.