Mamba (Structured State Space Model)
Mamba is a selective state space model architecture that achieves linear-time sequence modelling, offering a computationally efficient alternative to the Transformer for long-context tasks.
Mamba is a deep learning architecture based on selective state space models (SSMs) that processes sequential data with linear computational complexity relative to sequence length. Introduced in December 2023 by Albert Gu and Tri Dao in the paper Mamba: Linear-Time Sequence Modeling with Selective State Spaces, it emerged as one of the most significant architectural alternatives to the Transformer since the original attention-based model was proposed in 2017. Unlike the quadratic attention mechanism of Transformers, Mamba scales efficiently to very long sequences, making it attractive for tasks involving genomics, audio, and long-document language modelling.
Background: State Space Models
State space models originate from control theory and signal processing, where they describe how a system evolves over time given a sequence of inputs. In the context of deep learning, a structured SSM maps an input sequence to an output sequence through a latent hidden state. The core recurrence for a continuous-time SSM is:
where , , and are learnable matrices. During training, SSMs can be unrolled as a convolutional operation for parallelism; during inference they run as a recurrence for constant-memory generation. Earlier SSM architectures such as S4 (Structured State Space for Sequence Modelling) applied these ideas to deep learning but struggled to match Transformer performance on language tasks because they treated every token identically — the model could not selectively focus on or ignore specific inputs.
The Mamba Innovation: Selective State Spaces
The defining contribution of Mamba is input-dependent SSM parameters. Rather than holding , , and the discretisation step delta fixed across all tokens, Mamba makes these parameters functions of the current input . This mechanism, termed selective state spaces, allows the model to selectively retain relevant context and discard irrelevant information at each step. The selection mechanism is analogous to what attention provides in Transformers — a content-aware routing of information — but without the quadratic cost of computing pairwise token similarities.
The architecture pairs the selective SSM with a hardware-aware parallel scan algorithm that avoids materialising large intermediate tensors, enabling efficient GPU execution despite the recurrent structure.
Mamba-2 and Structural State Space Duality
In 2024, Gu and Dao published a follow-up work introducing Mamba-2, which established a formal connection between structured SSMs and a restricted class of linear attention mechanisms — a result called Structured State Space Duality (SSD). This theoretical result allowed the authors to design a simplified SSM layer, the SSD layer, that supports larger state dimensions, achieves 2 to 8 times faster training than Mamba-1, and integrates more naturally with tensor-parallel training strategies used for large models.
Performance Characteristics
Mamba achieves 5 times higher inference throughput compared to equivalently sized Transformer models at long sequence lengths. In terms of language modelling perplexity, a Mamba model at 3 billion parameters matches a Transformer at roughly 6 billion parameters, while being approximately 40 percent cheaper to run. These gains become more pronounced as sequence length grows, because the Transformer's memory and compute requirements grow quadratically whereas Mamba's grow linearly.
| Property | Transformer | Mamba | |---|---|---| | Attention complexity | O(n^2) | O(n) | | Memory at inference | O(n) KV cache | O(state size), fixed | | Parallelism in training | Full | Via parallel scan | | Content-aware routing | Yes (attention) | Yes (selective SSM) | | Positional encoding needed | Yes | No |
Hybrid Architectures
Following Mamba's release, several research groups proposed hybrid architectures that interleave Mamba layers with Transformer attention layers. Models such as Jamba (from AI21 Labs) and Zamba combine the long-range efficiency of SSM layers with the associative recall strengths of attention layers. These hybrids often outperform pure Mamba or pure Transformer models of equivalent parameter count, suggesting that the two mechanisms are complementary rather than mutually exclusive.
Applications Beyond Language
The linear-time property of Mamba makes it especially valuable in domains with very long sequences. In genomics, sequences of DNA bases can extend to millions of tokens; the Caduceus model applied Mamba to DNA language modelling. In audio, SSMs have modelled raw waveforms at high sample rates without the memory bottlenecks that constrain Transformer-based audio models. Vision applications include video understanding, where frame-level tokens accumulate rapidly.
See Also
References
- Gu, A., and Dao, T. (2023). Mamba: Linear-Time Sequence Modeling with Selective State Spaces. arXiv:2312.00752.
- Dao, T., and Gu, A. (2024). Transformers are SSMs: Generalized Models and Efficient Algorithms through Structured State Space Duality. arXiv:2405.21060.
- Gu, A., Goel, K., and Re, C. (2021). Efficiently Modeling Long Sequences with Structured State Spaces. arXiv:2111.00396.
- Mindstudio. (2025). What Is Mamba 3? The State Space Model Architecture That Challenges Transformers. MindStudio Blog.