AIWiki
Malaysia

Foundation Model

A large-scale AI model pretrained on broad, diverse datasets and designed to be adapted to a wide range of downstream tasks through fine-tuning, prompting, or retrieval augmentation.

6 min readLast updated June 2026Models

A foundation model is a large-scale artificial intelligence model trained on broad, diverse datasets — typically at internet scale — using self-supervised or weakly supervised learning, and subsequently adapted to a wide variety of downstream tasks. The term was introduced by the Center for Research on Foundation Models (CRFM) at Stanford University in a landmark 2021 report, which defined foundation models by two key properties: they are trained on broad data at scale, and they are adaptable to a wide range of downstream tasks through processes such as fine-tuning, prompting, or retrieval augmentation.

The "foundation" metaphor is deliberate: these models serve as a common base upon which task-specific applications are built, rather than training a separate specialised model for each application. Practitioners leverage the general representations learned during pretraining and invest relatively little additional compute in task-specific adaptation.

Pretraining at Scale

Foundation models are distinguished above all by the scale of their pretraining. They are trained on datasets comprising hundreds of billions to trillions of tokens of text, billions of images, or combinations of modalities, using distributed training across thousands of specialised accelerators (GPUs or TPUs) over weeks or months.

The pretraining objective varies by modality. Language foundation models typically use next-token prediction (autoregressive language modelling) or masked token prediction (as in BERT). Vision foundation models use contrastive objectives, masked image modelling, or image-text alignment (as in CLIP). Multimodal models combine these approaches.

The scaling laws empirically validated by Kaplan and colleagues at OpenAI (2020) and subsequently by Hoffmann and colleagues at DeepMind (Chinchilla, 2022) hold that model capabilities improve predictably as a power law with increases in model parameters, training data, and compute, provided these three factors are scaled in proportion. This regularity has guided the design of successive model generations.

Architecture

The dominant architecture for foundation models is the Transformer, introduced by Vaswani and colleagues in 2017. Its self-attention mechanism scales efficiently to long contexts and parallelises well across accelerator arrays. GPT-family models use a decoder-only Transformer; BERT uses an encoder-only architecture; T5 uses an encoder-decoder design. The LLaMA family, Mistral, and Qwen use decoder-only architectures with refinements including grouped-query attention, rotary positional embeddings (RoPE), and SwiGLU activation functions.

Adaptation Methods

Foundation models are not typically deployed directly from pretraining weights. Several adaptation strategies are used.

Fine-tuning updates some or all of the model's parameters on a task-specific labelled dataset. Full fine-tuning is computationally expensive; parameter-efficient methods such as LoRA (Low-Rank Adaptation) and adapters fine-tune a small fraction of parameters while keeping the rest frozen.

Prompting and in-context learning bypass weight updates entirely. A carefully constructed natural language prompt — potentially including examples of the task — is prepended to the input, and the model generates the desired output in a zero-shot or few-shot setting.

Retrieval-augmented generation (RAG) augments a frozen foundation model with an external knowledge base, enabling it to incorporate facts not present in its training data without retraining.

Reinforcement learning from human feedback (RLHF) further aligns a pretrained model with human preferences and instructions, producing instruction-following models such as ChatGPT and Claude.

Notable Examples

As of mid-2026, prominent foundation models include GPT-4 and GPT-5 (OpenAI), Claude 3 Opus and Claude Sonnet 4 (Anthropic), Gemini 2.5 Pro (Google DeepMind), Llama 3.x and Llama 4 (Meta), Mistral Large (Mistral AI), Qwen 2.5 and Qwen 3 (Alibaba), and DeepSeek-V3 (DeepSeek AI). These span text, vision, code, audio, and multimodal capabilities, and are deployed via cloud APIs, on-premises installations, and edge devices depending on model size.

Open-weight models — whose parameters are publicly released — such as Llama and Mistral have enabled a broad ecosystem of derivative models fine-tuned for specific languages, domains, and applications.

Governance and Concerns

Foundation models raise policy and ethical concerns. Their training data often contains copyrighted material, personal information, and biased content, raising questions about intellectual property, privacy, and fairness. The concentration of capability in a small number of large organisations — due to the enormous capital requirements for training at scale — has prompted discussions about access, competition, and AI governance. The EU AI Act (2024) designates general-purpose AI (GPAI) models above a certain compute threshold as requiring additional transparency and safety obligations.

See Also

References

References

  1. Bommasani, R., et al. (2021). On the opportunities and risks of foundation models. arXiv:2108.07258. Stanford CRFM.
  2. Kaplan, J., et al. (2020). Scaling laws for neural language models. arXiv:2001.08361. OpenAI.
  3. Hoffmann, J., et al. (2022). Training compute-optimal large language models. arXiv:2203.15556. DeepMind.
  4. Vaswani, A., et al. (2017). Attention is all you need. NeurIPS 2017.
  5. AWS. (2025). What are foundation models? Amazon Web Services documentation. aws.amazon.com.