Grouped-Query Attention (GQA)
An attention variant that partitions query heads into groups sharing a single set of keys and values, reducing memory bandwidth during inference while retaining most of the quality of full multi-head attention.
Grouped-Query Attention, abbreviated GQA, is a variant of the attention mechanism used in transformer models that reduces the memory and bandwidth cost of inference while preserving most of the accuracy of standard attention. It was introduced by Google researchers in the May 2023 paper "GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints." GQA partitions the query heads of a transformer into several groups, with each group sharing a single set of key and value projections, striking a balance between two earlier designs.
The problem GQA solves
In standard multi-head attention, each attention head has its own separate query, key and value projections. During autoregressive generation, the keys and values for all previously generated tokens are stored in a structure called the key-value cache, or KV cache, so they do not have to be recomputed at each step. The size of this cache grows with the number of heads, the sequence length and the batch size. Reading it from memory at every generation step becomes the dominant bottleneck for large models producing long outputs, because the operation is limited by memory bandwidth rather than raw computation.
Multi-query attention, an earlier proposal, addresses this by having all query heads share one single set of keys and values. This shrinks the KV cache dramatically and speeds up generation, but the aggressive sharing can reduce model quality and make training less stable.
The GQA approach
Grouped-query attention interpolates between the two extremes. Instead of one shared key-value set for every head, or a separate set per head, the query heads are divided into a small number of groups, and each group shares one key-value set. If the number of groups equals the number of heads, GQA reduces to ordinary multi-head attention; if there is only one group, it reduces to multi-query attention. Typical configurations use eight key-value groups, which captures most of the memory savings while keeping accuracy close to full multi-head attention.
The table summarises the three schemes.
| Scheme | Key-value sets | KV cache size | Quality | | --- | --- | --- | --- | | Multi-head attention | One per head | Largest | Highest | | Grouped-query attention | One per group | Moderate | Near multi-head | | Multi-query attention | One shared | Smallest | Slightly reduced |
An appealing practical detail is that GQA can be created by "uptraining" an existing multi-head model. The key and value projections of each group are combined by averaging, and the model is then trained for a short additional period to recover quality, avoiding the cost of training from scratch.
Impact and adoption
GQA has become a standard component of large language model design. Meta adopted it for the larger Llama 2 models in July 2023 and retained it across the Llama 3 family released in 2024. Mistral, Qwen and many other open-weight models use GQA as well. Its widespread use reflects a broader trend in which architectural choices are increasingly driven by the economics of inference serving, since a smaller KV cache allows longer context windows, larger batch sizes and lower serving cost per request.
Because GQA reduces the memory footprint of each request, it interacts closely with serving systems that use paged KV cache management and continuous batching to maximise the number of concurrent users a single accelerator can support.
References
- Ainslie, J. et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. arXiv:2305.13245.
- Shazeer, N. (2019). Fast Transformer Decoding: One Write-Head is All You Need. arXiv:1911.02150.
- IBM. (2024). What is grouped query attention (GQA)?. IBM Think Topics.