Gemma
Gemma is a family of open-weight large language models developed by Google DeepMind, built on similar technology to the Gemini series and available for deployment on hardware ranging from laptops to cloud infrastructure.
Gemma is a family of open-weight large language models released by Google DeepMind beginning in February 2024. Built on the same research and infrastructure that underpins Google's Gemini series, Gemma models are designed to be lightweight enough for deployment on consumer hardware — including laptops and edge devices — while remaining competitive with larger proprietary systems on a range of benchmarks. The models are released with open weights under a custom Gemma Terms of Use licence that permits research and commercial use subject to certain restrictions.
Development and Release History
Google DeepMind introduced the original Gemma in February 2024 with two model sizes: a 2-billion-parameter variant and a 7-billion-parameter variant. Both were offered in pre-trained and instruction-tuned forms. The instruction-tuned versions follow conversational prompts and are suitable for direct use in chat applications, while the pre-trained versions are intended for further fine-tuning on specialised tasks.
Gemma 2 followed in June 2024, initially with 9B and 27B parameter variants, and expanded to include a 2B variant in July 2024. Google claimed that the 27B model outperformed substantially larger open models on several standard evaluation benchmarks.
Gemma 3 debuted in March 2025 with four parameter sizes: 1B, 4B, 12B, and 27B. At launch, Google asserted that Gemma 3 outperformed competing open-source models including DeepSeek-V3 and Llama 3 405B on a subset of reasoning and coding benchmarks.
Gemma 4, released in April 2026, marked a significant expansion of the family's capabilities. Gemma 4 models are natively multimodal, accepting both text and image input and generating text output. The release includes models in four configurations: an effective 2B (E2B) variant, an effective 4B (E4B) variant, a 26B Mixture of Experts (MoE) variant, and a 31B dense model. Gemma 4 models support a context window of up to 256,000 tokens and over 140 languages.
Architecture and Design
Gemma models are based on the transformer decoder architecture with modifications drawn from Google's internal research. Key design choices include the use of multi-query attention to reduce memory bandwidth requirements during inference, rotary positional embeddings (RoPE) for improved length generalisation, and GeGLU activations in the feed-forward layers. These choices collectively improve inference efficiency on consumer-grade GPUs and CPUs.
The Gemma vocabulary is shared with the Gemini model family, enabling straightforward transfer of tokenisation pipelines and embedding initialisations between the two product lines.
Model Variants
The Gemma family includes several specialised derivatives beyond the core instruction-tuned and pre-trained variants. CodeGemma is optimised for code completion and generation tasks and is available in 2B and 7B sizes. PaliGemma is a vision-language variant that combines Gemma language components with a SigLIP vision encoder, enabling image captioning, visual question answering, and object detection. RecurrentGemma experiments with linear recurrent architecture alternatives to full attention for long-context tasks.
Ecosystem and Deployment
Gemma models are supported across the major ML frameworks, including Hugging Face Transformers, JAX, PyTorch, and TensorFlow. Google provides optimised inference kernels for its own hardware (TPUs) as well as for NVIDIA GPUs. The models can be run locally using tools such as Ollama and LM Studio, making them accessible to individual developers without cloud API costs.
Performance and Benchmarks
A central claim for the Gemma series has been strong performance relative to parameter count — sometimes described as being competitive with models two to four times larger. Gemma 3 27B, for instance, was reported to score comparably to models in the 70B range on the MMLU (Massive Multitask Language Understanding) benchmark and outperform certain 70B models on mathematical reasoning evaluations. These results reflect both architectural refinements and the quality of Google DeepMind's training data curation and filtering processes.
References
- Google DeepMind. (2024). Gemma: Introducing new state-of-the-art open models. Google Blog.
- Google DeepMind. (2025). Gemma 3 model card. Google AI for Developers.
- Google DeepMind. (2026). Gemma 4: Byte for byte, the most capable open models. Google Blog.
- Team, G. et al. (2024). Gemma: Open Models Based on Gemini Research and Technology. Google DeepMind Technical Report.
- Wikipedia. (2026). Gemma (language model). Wikimedia Foundation.