Mamba: Selective State Space Models

This lightning talk introduces Mamba, a selective state space model architecture that provides a linear-complexity alternative to Transformers for deep learning. We explore how Mamba achieves efficient long-sequence processing through input-dependent dynamics and hardware-optimized parallel scanning, then examine its rapid adoption across time series, vision, tabular data, and scientific computing with concrete performance gains.
Script
Transformers changed deep learning, but their attention mechanism costs quadratic time and memory in sequence length. That constraint means processing a high-resolution image or a million-token document becomes prohibitively expensive. Mamba offers a fundamentally different path: linear complexity with input-dependent state propagation.
Mamba builds on discrete state space models but makes them selective. The matrices B and C, which control how input enters the state and how state produces output, are now functions of the input itself. This selectivity lets the model decide what to remember and what to forget at every step, propagating global context with linear cost instead of quadratic.
That selectivity would mean nothing without practical speed.
Transformers achieve global context through all-to-all attention, but that explodes in cost. Mamba recurrently accumulates information in a hidden state, scanning the sequence in parallel with kernel fusion that keeps intermediate results in fast on-chip memory. The result is true linear throughput even for million-token inputs.
The efficiency translates to measurable wins. In time series forecasting, S-Mamba surpasses strong Transformer baselines on periodic, high-dimensional datasets while cutting GPU memory. In vision, locally bidirectional Mamba improves ImageNet accuracy and semantic segmentation scores with no throughput penalty. For optical flow, MambaFlow delivers state-of-the-art precision at real-time speed.
Mamba is not without trade-offs. Converting images or 3D point clouds into sequences can sacrifice spatial coherence, and bidirectional or zigzag scanning patterns introduce gradient dynamics that require careful tuning. Researchers are exploring hybrid models that pair Mamba layers with selective attention, balancing efficiency and fine-grained context.
Mamba demonstrates that recurrence, selectivity, and hardware awareness together can challenge the Transformer paradigm. As models scale to longer contexts and higher dimensions, linear complexity is not just convenient—it is essential. Visit EmergentMind.com to explore more cutting-edge research and create your own videos.