AIWiki
Malaysia

Small Language Models

Small language models (SLMs) are compact language models with fewer than around 10 billion parameters, designed for efficient deployment on edge devices, mobile hardware, and resource-constrained environments.

6 min readLast updated June 2026Models

Small language models (SLMs) are a category of language model characterised by a comparatively low parameter count — typically below 10 billion parameters — that achieves strong performance on a broad range of tasks while remaining feasible to deploy on consumer hardware, mobile devices, and industrial edge systems without cloud connectivity. The category emerged as a response to the cost, latency, and privacy limitations of large cloud-hosted language models, and has grown rapidly since 2023 as training methodology advances allowed smaller models to approach or match the quality of much larger predecessors.

The boundary between "small" and "large" is not formally defined and has shifted over time. Models that were considered large in 2020, such as GPT-2 at 1.5 billion parameters, are now firmly in the small category. In 2025 and 2026, the practical threshold for SLMs is often placed at 7 billion parameters, although sub-billion-parameter models have demonstrated utility on highly focused tasks.

Why Small Models Matter

The dominance of large language models such as GPT-4, Claude, and Gemini Ultra has obscured an important practical reality: most enterprise and consumer AI tasks do not require frontier-scale reasoning. Summarising a customer support ticket, classifying product reviews, extracting structured data from a document, or answering domain-specific questions can often be performed with high accuracy by a well-trained 3-7 billion parameter model.

Deploying a smaller model carries several advantages. Inference cost is dramatically reduced — a 3.8B parameter model running on a consumer GPU processes tokens an order of magnitude cheaper than a 70B model served from a cloud cluster. Latency improves because there is no network round-trip and the model fits entirely in local memory. Privacy is preserved because sensitive data never leaves the device. Offline reliability is guaranteed in environments without internet access, including industrial plants, aircraft, remote field operations, and healthcare facilities.

Key Models

Microsoft Phi Series

The Phi series, developed by Microsoft Research, demonstrated that careful curation of training data could yield models that punch well above their weight. Phi-1 (2023) achieved state-of-the-art results on Python coding benchmarks despite having only 1.3 billion parameters, trained on a corpus of "textbook-quality" synthetic data rather than noisy web crawls.

Phi-3 (2024) extended this approach with 3.8 billion parameters and delivered performance comparable to GPT-3.5 on standard benchmarks. Phi-3-mini, at 3.8B, fits comfortably within the 4GB memory envelope of many consumer smartphones. Phi-4 (2025) pushed further, introducing multimodal capabilities and an extended context window while remaining under 15 billion parameters.

Google Gemma Series

Gemma, released by Google DeepMind in 2024, is a family of open-weight models trained on the same infrastructure used for Gemini. Gemma 2 offered 2B and 9B variants with strong reasoning and instruction-following capability. Gemma 3 (2025) introduced a 128K token context window and multimodal input handling, setting new benchmarks for the under-10B parameter class.

Meta Llama 3.2

Llama 3.2, released in September 2024, included 1B and 3B parameter variants specifically designed for mobile deployment. These models were distilled from larger Llama 3.1 variants and achieved near-parity on many tasks. Meta released these models under a permissive licence, enabling wide commercial deployment.

Qwen and Other Chinese SLMs

Alibaba's Qwen series has produced competitive small models, including Qwen2-1.5B and Qwen2-7B, with strong multilingual capability across Chinese, Malay, Indonesian, and other Southeast Asian languages. This makes them particularly relevant for regional deployment.

Training Techniques for Small Models

Several techniques enable small models to achieve quality beyond what raw parameter count would suggest.

Synthetic data training uses AI-generated "textbook" content that is dense, factually accurate, and pedagogically structured. Microsoft's Phi series relies heavily on this approach, which focuses learning capacity on signal-rich examples rather than low-quality web noise.

Distillation transfers knowledge from a large teacher model to a smaller student model by training the student to reproduce not just the correct output but the probability distribution over outputs from the teacher. Distillation consistently improves small model quality at minimal additional training cost.

Quantisation reduces the numerical precision of model weights from 32-bit or 16-bit floating point to 8-bit integers or 4-bit representations. A 7B model quantised to 4-bit occupies roughly 4GB of memory, fitting on a single consumer GPU or a high-end smartphone. Frameworks such as GGUF, llama.cpp, and Apple's Core ML enable quantised SLM inference on a wide range of hardware.

Instruction tuning and RLHF align small models with human preferences and specific task formats, improving usability without requiring additional parameters.

Use Cases

On-device personal assistants, offline document processing, embedded quality control in manufacturing, diagnostic support in rural healthcare facilities, and real-time translation on consumer devices are among the most common SLM deployment scenarios. In enterprise settings, SLMs are frequently fine-tuned on domain-specific corpora — legal documents, engineering manuals, or medical records — to create specialised models that outperform general-purpose large models on narrow tasks.

See Also

References

  1. Gunasekar, S., et al. (2023). Textbooks Are All You Need. arXiv:2306.11644.
  2. Abdin, M., et al. (2024). Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone. arXiv:2404.14219.
  3. Google DeepMind. (2025). Gemma 3 Model Card. ai.google.dev/gemma.
  4. Meta AI. (2024). Llama 3.2: Lightweight Models for Mobile and Edge. ai.meta.com.
  5. IBM. (2025). What Are Small Language Models?. ibm.com/think/topics/small-language-models.