MambAttention: Hybrid SSM & Attention

This presentation explores MambAttention, a revolutionary hybrid architecture that combines Mamba-style Selective State Space Models with attention mechanisms. We examine how this fusion achieves linear-time complexity while outperforming pure Transformer architectures across speech enhancement, time series forecasting, and vision tasks, with particular emphasis on the architectural innovations that enable unprecedented out-of-domain generalization and computational efficiency.
Script
Transformers dominate modern AI, but their quadratic complexity makes them prohibitively expensive for long sequences. MambAttention solves this by fusing Mamba's linear-time state space modeling with strategic attention mechanisms, achieving both efficiency and expressiveness in a single architecture.
The architecture rests on three pillars. Mamba blocks handle sequence modeling through state-dependent recurrences that scale linearly. Attention modules are applied selectively—targeting time, frequency, or spatial dimensions depending on the task. The magic happens when these are interleaved: SSM provides efficient global memory while attention adds expressive, context-aware weighting exactly where it matters most.
Let's see how this plays out in one of the most challenging real-world domains: cleaning up noisy speech.
For speech enhancement, MambAttention applies attention in both time and frequency, reshaping the input to alternate between these views. The critical insight: sharing weights between time and frequency attention heads acts as a powerful regularizer. Trained on the difficult VB-DemandEx dataset, the model achieves state-of-the-art results on completely unseen test sets, outperforming Conformer by 1.5 decibels on DNS 2020 and reaching the highest reported PESQ scores on EARS-WHAM_v2.
The same principles scale across modalities. In time series forecasting, adding fast-attention to channel-independent Mamba modeling enables the model to learn inter-variable relationships, slashing prediction error by 26 percent on benchmarks like traffic and electricity demand. For video generation, Matten achieves a 56 percent reduction in distortion compared to transformer baselines while maintaining linear complexity. In vision, StableMamba demonstrates that interleaving attention with SSM blocks overcomes the scalability ceiling of pure Mamba architectures, delivering measurable accuracy gains on ImageNet without requiring knowledge distillation.
Why does interleaving work so well? Ablation studies reveal that placing attention before Mamba blocks consistently improves out-of-domain performance. In vision, pure SSM stacks suffer from training instability and poor gradient flow at scale; interleaving attention layers regularizes the spectral properties of the network and stabilizes optimization. The result is an architecture that matches or exceeds transformer expressiveness while retaining the linear complexity and parameter efficiency of state space models.
MambAttention demonstrates that hybrid architectures can escape the efficiency-expressiveness tradeoff, achieving state-of-the-art results across speech, time series, and vision with linear complexity. To explore this work in depth and create your own research videos, visit EmergentMind.com.