AIWiki
Malaysia

Neural Scaling Laws

Neural scaling laws are empirical relationships describing how the performance of neural networks improves predictably as a function of model size, dataset size, and compute budget, enabling principled resource allocation for AI training.

7 min readLast updated June 2026Foundations

Neural scaling laws are quantitative empirical relationships that describe how the performance of neural networks — typically measured as loss on a held-out evaluation set — improves as a function of three key variables: the number of model parameters, the size of the training dataset, and the total compute budget available for training. First systematically characterised for large language models by researchers at OpenAI in 2020, scaling laws have become a foundational tool for planning AI training runs, allocating resources efficiently, and extrapolating expected performance improvements from smaller experiments to larger ones.

The central finding is that performance follows smooth power-law relationships across many orders of magnitude of scale. If one doubles the number of model parameters while holding other factors constant, performance improves by a predictable, consistent amount. This regularity across an enormous range of scales — from models with millions of parameters to those with hundreds of billions — was initially surprising and has profoundly influenced the strategic direction of AI development.

The Kaplan et al. Scaling Laws (2020)

The landmark scaling law study, published by Jared Kaplan, Sam McCandlish, and colleagues at OpenAI in January 2020, analysed language model perplexity as a function of model size (N, measured in non-embedding parameters), dataset size (D, measured in tokens), and compute (C, measured in floating point operations). Their key findings were:

Performance improves as a power law in each of N, D, and C, with exponents that are relatively consistent across model architectures, tokenisation schemes, and training datasets.

For a fixed compute budget C, there is an optimal allocation between model size and training data. The 2020 paper found that larger models are significantly more sample-efficient than smaller ones, suggesting that, for a given compute budget, one should prefer to train a larger model on fewer tokens rather than a smaller model on more tokens.

Performance is primarily limited by whichever of N, D, or C is smallest. Increasing one resource while holding the others fixed produces diminishing returns beyond a certain scale.

The Chinchilla Scaling Laws (2022)

A subsequent analysis by Jordan Hoffmann, Sebastian Borgeaud, and colleagues at Google DeepMind, published in March 2022 and known as the "Chinchilla paper" after the model it produced, refined and partially revised the 2020 findings. Using a more rigorous experimental design with models trained at compute-optimal frontiers, the Chinchilla analysis found that the 2020 recommendation to prefer larger models was overstated.

The Chinchilla paper found that, for a given compute budget, the optimal allocation is to train a model with roughly 20 tokens of training data for every parameter. This is substantially more data than the ratio implied by the 2020 results. Many existing large language models, including GPT-3 (trained with approximately 300 billion tokens on 175 billion parameters, a ratio of 1.7:1 rather than 20:1), were significantly undertrained relative to their size.

Chinchilla (70 billion parameters, trained on 1.4 trillion tokens) outperformed much larger models including Gopher (280B parameters, 300B tokens) and GPT-3 on a wide range of tasks, validating the revised scaling prescription. The Chinchilla results shifted industry practice significantly: subsequent models including LLaMA, Mistral, and Falcon adopted much longer training runs relative to model size.

Extensions and Nuances

Emergent Capabilities

While scaling laws predict smooth performance improvement on average loss metrics, researchers have observed that certain capabilities appear abruptly at specific scales — so-called emergent abilities that seem absent below a threshold and present above it. Examples include few-shot arithmetic, chain-of-thought reasoning, and calibrated uncertainty. The abruptness of emergence is partly an artefact of coarse evaluation metrics: more sensitive measures often reveal smoother transitions. Nonetheless, the possibility of qualitative capability changes at scale has important implications for AI safety and planning.

Data Quality and Repetition

Scaling laws assume independent and identically distributed training data. Repeating training data significantly degrades the improvement rate, meaning that in practice the supply of high-quality, non-repeated data represents a real constraint. Research on synthetic data generation (using AI to produce additional training material) and data quality filtering has become increasingly important as web-scale corpora approach exhaustion.

Inference Compute

The original scaling laws focus on training compute. More recent work has extended the analysis to inference compute — studying how performance improves when more computation is applied during inference, for example through chain-of-thought reasoning, repeated sampling, or verifier-guided search. This has become especially relevant with reasoning models such as OpenAI o3 and DeepSeek-R1, which spend substantially more inference compute than earlier models.

Implications for AI Development

Scaling laws provide AI laboratories with a rational basis for planning large training runs. By running small-scale experiments and fitting power laws to the results, researchers can extrapolate the expected performance of models that would take months and tens of millions of dollars to train, validating the likely outcome before committing resources.

The predictability of scaling has driven a sustained increase in AI training expenditure: each doubling of compute yields a predictable performance improvement, making continued investment rational as long as performance remains economically valuable.

See Also

References

  1. Kaplan, J., et al. (2020). Scaling Laws for Neural Language Models. arXiv:2001.08361. OpenAI.
  2. Hoffmann, J., et al. (2022). Training Compute-Optimal Large Language Models. arXiv:2203.15556. Google DeepMind.
  3. Wei, J., et al. (2022). Emergent Abilities of Large Language Models. Transactions on Machine Learning Research.
  4. Muennighoff, N., et al. (2023). Scaling Data-Constrained Language Models. NeurIPS 2023.
  5. Anthropic. (2023). Scaling: The State of Play in AI. anthropic.com/research.