Mamba-3: Optimized Sequence Modeling

Updated 2 July 2026

Mamba-3 is a sequence modeling architecture that leverages exponential-trapezoidal discretization and complex state updates to enhance both expressivity and inference efficiency for long-context tasks.
It replaces Transformer self-attention with a linear-compute, constant-memory SSM layer using MIMO formulations and a RoPE trick, significantly boosting hardware utilization.
Empirical results show Mamba-3 outperforms prior models in language modeling, state tracking, and 3D medical segmentation with lower latency and scalable throughput.

Mamba-3 is a sequence modeling architecture rooted in state space model (SSM) principles, designed to reconcile the trade-offs between sequence modeling quality and inference-time efficiency. Its innovations address the computational and modeling limitations of both classical Transformer architectures and earlier linear SSM-inspired models, establishing a new Pareto frontier for throughput and accuracy in both natural language and high-dimensional data domains (Lahoti et al., 16 Mar 2026).

1. Foundations and Rationale

Mamba-3 is motivated by the high inference cost of modern LLMs, particularly the $O(T^2)$ time and $O(T)$ memory complexity of Transformer self-attention (where $T$ is the input sequence length). Mamba-3 adheres to an “inference-first” principle, eschewing self-attention in favor of a linear-compute, constant-memory, hardware-optimized mixer. This is achieved by extending the state space modeling approach used in prior variants (e.g., Mamba-2, Gated DeltaNet), with the goal of improving modeling expressivity—especially for state-tracking and long-context tasks—while saturating hardware utilization during autoregressive decoding (Lahoti et al., 16 Mar 2026).

2. Core Innovations in State Space Modeling

2.1 Exponential-Trapezoidal Discretization

Mamba-3 generalizes the SSM discrete recurrence using an exponential-trapezoidal scheme. Given the ODE

$\dot h(t) = A(t) h(t) + B(t) x(t),\quad y(t)=C(t)^\top h(t),$

the model discretizes the interval $\Delta_t$ between inputs via a learnable $\lambda_t$ -weighted two-point trapezoidal rule:

$h_t = e^{\Delta_t A_t} h_{t-1} + (1-\lambda_t)\Delta_t e^{\Delta_tA_t} B_{t-1} x_{t-1} + \lambda_t \Delta_t B_t x_t,$

where $\lambda_t \in [0,1]$ is a learnable parameter. Special cases recover classical Euler and trapezoidal discretizations. This richer recurrence yields improved expressivity over earlier exponential-Euler SSMs by precisely encoding input influence over each transition (Lahoti et al., 16 Mar 2026).

2.2 Complex-Valued State Updates

To overcome the real-valued decay's inability to capture rotational dynamics, Mamba-3 employs a complex-valued state update. The formulation introduces a complex drift:

$\dot h(t) = (A(t) + i\theta(t)) h(t) + (B(t) + i\hat B(t)) x(t),\quad y(t) = \Re[(B(t)+i\hat B(t))^\top h(t)].$

Through exponential-Euler discretization, the update is efficiently reducible to a real-valued block-diagonal matrix of $2\times2$ rotations, which are implemented via a data-dependent Rotary Positional Embedding (RoPE) trick. This complex expressivity is essential for tasks involving periodicity or parity tracking, with negligible asymptotic cost increase (Lahoti et al., 16 Mar 2026).

2.3 Multi-Input Multi-Output (MIMO) SSMs

Mamba-3 extends the single-input single-output (SISO) SSM to a higher-rank multi-input multi-output (MIMO) formulation. Rather than updating one state per input channel, it applies a matmul update:

$O(T)$ 0

where $O(T)$ 1, $O(T)$ 2, and $O(T)$ 3 is the MIMO rank. This boosts hardware arithmetic intensity (FLOPs per byte) and enables the model to approach device throughput saturation without incurring significant memory overhead or increased decode latency (Lahoti et al., 16 Mar 2026).

3. Architectural Integration

Mamba-3 layers replace self-attention in Transformer-like pre-norm blocks. Each block consists of a Mamba-3 SSM layer (with exponential-trapezoidal and complex, optionally MIMO routines), followed by residual gating, BC/QK normalization (RMSNorm), and a SwiGLU feed-forward module with expansion factor 2. Channel-wise learnable biases are introduced to both $O(T)$ 4 and $O(T)$ 5 projections following normalization, eliminating the necessity for external convolutions or activations. No explicit convolutional modules or additional non-linearities are required beyond those built into the SSM and feed-forward blocks (Lahoti et al., 16 Mar 2026).

4. Computational Complexity and Hardware Efficiency

Relative to Transformers, Mamba-3 achieves sub-quadratic runtime and constant memory. The scaling properties are:

Self-attention: Compute $O(T)$ 6, Memory $O(T)$ 7
Mamba-3 SISO: Compute $O(T)$ 8, Memory $O(T)$ 9 (constant in $T$ 0)
Mamba-3 MIMO: Compute $T$ 1, Memory $T$ 2 (constant in $T$ 3)

Arithmetic intensity is increased, as MIMO SSMs approach the hardware peak of $T$ 4 ops/byte on H100 accelerators. Optimized kernels achieve SISO decoding times of 0.127 ms (bf16, $T$ 5) and MIMO ( $T$ 6) at 0.156 ms, compared to Mamba-2's 0.203 ms. The KV cache size remains constant with sequence length, which is critical for million-token applications (Lahoti et al., 16 Mar 2026).

5. Empirical Performance and Evaluation

5.1 Language Modeling and State Tracking

Pretrained on 100B FineWeb-Edu tokens (2K context), Mamba-3 demonstrates:

At 1.5B parameters:
- SISO variant surpasses Gated DeltaNet by +0.6 percentage points average accuracy
- MIMO variant adds +1.2 points, for +1.8 total
Benchmarks: Outperforms Mamba-2, GDN, and Transformers on LAMBADA, HellaSwag, PIQA, ARC, and long-context “needle-in-a-haystack” tasks
State-Tracking: Only the complex-valued, RoPE-enhanced Mamba-3 solves formal state-tracking problems (Parity, Modular Arithmetic) where real-valued and standard RoPE variants fail (Lahoti et al., 16 Mar 2026)

5.2 Efficiency-Quality Frontier

Mamba-3 achieves equivalent perplexity to Mamba-2 using half the state size, yielding lower latency and lower memory usage. The model consistently shifts the perplexity–decode-speed Pareto curve downward and rightward, indicating improved quality at higher throughput (Lahoti et al., 16 Mar 2026).

5.3 Deployment in Large Models

Mamba-3 serves as a core backbone in hybrid and mixture-of-experts LLMs, such as Nemotron 3 Nano and Nemotron 3 Ultra (NVIDIA et al., 23 Dec 2025, NVIDIA et al., 12 Jun 2026), facilitating:

Multi-million token context windows (1M+) due to constant memory and linear compute
Throughput improvements (up to 6×) over similarly-sized transformer-based open LLMs
High accuracy on diverse reasoning, retrieval, and agentic benchmarks

6. Applications Beyond Text: 3D Data and Medical Imaging

Mamba-3 underpins models for 3D volumetric segmentation by integrating custom depthwise convolutions and multi-scale Mamba blocks for high spatial locality and context capture (Wang et al., 25 Mar 2025). Key findings include:

3D-DWConv prior to the SSM enables state-of-the-art Dice with one-third the FLOPs of transformer backbones
Multi-scale blocks (MSv4) further boost accuracy
Scanning strategies: Single-scan suffices for most scenarios; Tri-scan excels in high-class-count or complex tasks
Best practice: Use UlikeMamba_3dMT with 3D-DWConv, MSv4, and Tri-scan for maximal accuracy-efficiency trade-off in clinical segmentation pipelines (Wang et al., 25 Mar 2025)

For multimodal tasks such as text-driven 3D medical image segmentation, Mamba-3 is fused with components like EGSC for spatial unfolding, Tri-orientated Mamba for context in all axes, and advanced nonlinear blocks (3D-GR-KAN). This composite architecture excels in multi-organ and tumor segmentation, outperforming CNNs and transformers on both accuracy and GPU efficiency (Yang et al., 24 May 2025).

7. Significance and Implications

The Mamba-3 architecture establishes a new benchmark for sequence modeling in terms of quality, efficiency, and scalability. Its principled advancements—exponential-trapezoidal recurrence, complex state dynamics, and MIMO SSMs—are demonstrated to be both necessary and sufficient for state tracking, retrieval, and long-context understanding previously out of reach for sub-quadratic models. The design choices enable both text-centric and vision-centric applications to match or exceed transformer-based counterparts, with markedly lower memory and compute requirements (Lahoti et al., 16 Mar 2026, Wang et al., 25 Mar 2025, Yang et al., 24 May 2025, NVIDIA et al., 23 Dec 2025, NVIDIA et al., 12 Jun 2026).

A plausible implication is that the architectural paradigm established by Mamba-3 (hybrid complex SSMs with optimized numerics and hardware alignment) will persist as a favored solution for large-scale, real-time sequence modeling in both language and high-dimensional data, especially as context windows and model sizes continue to grow.

Markdown Report Issue Upgrade to Chat

References (5)

Mamba-3: Improved Sequence Modeling using State Space Principles (2026)

Nemotron 3 Nano: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning (2025)

Nemotron 3 Ultra: Open, Efficient Mixture-of-Experts Hybrid Mamba-Transformer Model for Agentic Reasoning (2026)

A Comprehensive Analysis of Mamba for 3D Volumetric Medical Image Segmentation (2025)

TK-Mamba: Marrying KAN with Mamba for Text-Driven 3D Medical Image Segmentation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Mamba-3.

Mamba-3: Optimized Sequence Modeling

1. Foundations and Rationale

2. Core Innovations in State Space Modeling

2.1 Exponential-Trapezoidal Discretization

2.2 Complex-Valued State Updates

2.3 Multi-Input Multi-Output (MIMO) SSMs

3. Architectural Integration

4. Computational Complexity and Hardware Efficiency

5. Empirical Performance and Evaluation

5.1 Language Modeling and State Tracking

5.2 Efficiency-Quality Frontier

5.3 Deployment in Large Models

6. Applications Beyond Text: 3D Data and Medical Imaging

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Mamba-3: Optimized Sequence Modeling

1. Foundations and Rationale

2. Core Innovations in State Space Modeling

2.1 Exponential-Trapezoidal Discretization

2.2 Complex-Valued State Updates

2.3 Multi-Input Multi-Output (MIMO) SSMs

3. Architectural Integration

4. Computational Complexity and Hardware Efficiency

5. Empirical Performance and Evaluation

5.1 Language Modeling and State Tracking

5.2 Efficiency-Quality Frontier

5.3 Deployment in Large Models

6. Applications Beyond Text: 3D Data and Medical Imaging

7. Significance and Implications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research