Decision Mamba in Reinforcement Learning

Updated 14 February 2026

Decision Mamba is a reinforcement learning framework that uses selective state-space models to achieve linear time and memory complexity in sequential decision-making.
It integrates hierarchical and hybrid architectures to efficiently handle long-horizon trajectories, demonstrating strong empirical results on benchmarks like Atari and MuJoCo.
Empirical studies reveal up to a 30% reduction in parameters and significant inference speedups, offering a scalable alternative to transformer-based models.

Decision Mamba is a family of reinforcement learning (RL) agents based on selective state space models (SSMs), designed to surpass the scalability and context-limitation bottlenecks of transformer-based sequence modeling in sequential decision-making problems. Integrating the Mamba SSM with RL’s reward-conditioned sequence modeling, Decision Mamba variants achieve linear time and memory complexity in context length, strong empirical results across imitation and trajectory optimization benchmarks, and a foundation for hierarchical, hybrid, and multi-scale RL architectures (Huang et al., 2024).

1. Selective State Space Models for RL Sequence Modeling

Decision Mamba replaces the self-attention backbone of the Decision Transformer with a Mamba block, a selective SSM inspired by general state space sequence modeling. The basic dynamic is given by the continuous-time model: $\frac{d}{dt} h(t) = A h(t) + B x(t),\qquad y(t) = C h(t),$ where $h(t)$ is a hidden state, $x(t)$ denotes the embedded input (state, action, or reward), and $A, B, C$ are learnable parameters. After zero-order-hold discretization, the sequence can be modeled in discrete time as: $h_t = A h_{t-1} + B x_t,\qquad y_t = C h_t.$ Mamba introduces data-dependent selection: each token computes a gating vector $g_t = \sigma(W_g x_t)$ , then applies: $\tilde h_t = g_t \odot (A h_{t-1}) + (1 - g_t) \odot (B x_t),\qquad h_t = \tilde h_t + \mathrm{ParallelScan}(\tilde h_{1:t}),$ where "ParallelScan" is a hardware-friendly parallel prefix sum enabling efficient, $O(Td)$ scaling in time and memory (Huang et al., 2024). No explicit causal masking or positional encoding is needed, as the state-space kernel inherently models temporal order and distance, and hidden attention decays exponentially with lookback distance (Dai et al., 2024).

2. Architectural Variants and Hybrid Integration

Multiple architectural variants of Decision Mamba have been developed for different RL regimes:

Vanilla Decision Mamba (DM): A stack of SSM-based token-mixing Mamba blocks, alternating with feedforward (MLP) channel-mixing, ingesting reward-conditioned or plain state-action trajectories without returning to-go conditioning as a mandatory input (Ota, 2024, Correia et al., 2024, Cao et al., 2024).
Multi-Grained/Hierarchical Decision Mamba: Combines a coarse-grained inter-step SSM (tracking trajectory history) and a fine-grained intra-step SSM (encoding the structural relationship between RTG, state, and action within each step). This architecture improves utilization of temporal correlations and intra-step structure; fusion is performed at each layer (Lv et al., 2024).
Decision Mamba-Hybrid (DM-H): A two-stage architecture where the Mamba backbone processes long-horizon, across-episode context and generates sub-goal embeddings every $c$ steps, and a local GPT-style transformer decodes actions for contextual windows of size $c$ (Huang et al., 2024). Sub-goals are learned by maximizing downstream value.
Hybrid Mamba–Attention/Diffusion: In 3D manipulation and diffusion policy settings, a hybrid Mamba + self-attention UNet (“X-Mamba”) achieves parameter and FLOPs reduction while retaining state-of-the-art performance (Cao et al., 2024).
Decision MetaMamba: Adds a windowed, multimodal token mixer before Mamba input, compensating for information loss in selective scan by explicitly fusing adjacent state, action, and RTG tokens (Kim, 2024).
Hierarchical Decision Mamba (HDM): Implements explicit sub-goal planning via a meta Mamba (high-level) and control Mamba (low-level) stacked hierarchy, and can be deployed in agentic frameworks, such as RAN slicing for 6G networks, coordinating sub-agents through LLM-interpreted operator goals (Habib et al., 29 Dec 2025).

3. Mathematical Formulation and Learning Objectives

Typical Decision Mamba models operate on reward-conditioned trajectories: $h(t)$ 0 with $h(t)$ 1. Inputs are separately embedded, concatenated, and passed through a stack of 3K-length Mamba layers.

Losses: For continuous actions, mean squared error between predicted and expert action; for discrete actions, cross-entropy loss on action logits. Multi-task extensions also predict next state and next RTG.
Self-Evolution Regularization (PSER): Combines imitation of ground-truth actions with the model’s previous outputs, mitigating overfitting to noisy or suboptimal trajectories (Lv et al., 2024).
Sub-goal selection (in DM-H) is performed by scoring future windows by expected downstream return and selecting maximizing states, with sub-goals injected as additional context for the local transformer.

4. Computational and Parameter Efficiency

Mamba’s linear-complexity kernel allows strict $h(t)$ 2 scaling in both time and memory, contrasting with transformers’ $h(t)$ 3 cost. For context size $h(t)$ 4, empirical results show Decision Mamba uses ≈30% fewer parameters and 20% fewer MACs than Decision Transformer on Atari, and up to a quarter of the parameters in MuJoCo (Dai et al., 2024). Hybrid Decision Mamba reduces online inference time by 28× in long-horizon tasks by restricting expensive attention to local prediction windows (Huang et al., 2024). In imitation learning, parameter-to-score ratio is dramatically improved over transformer baselines (Kim, 2024).

5. Empirical Performance Across Tasks

Decision Mamba variants consistently achieve—or exceed—state-of-the-art returns in a wide range of settings:

On D4RL locomotion, AntMaze, and Kitchen, DM and HDM surpass Decision Transformer (DT) and Algorithm Distillation, especially in long-horizon and hierarchical benchmarks; in some settings HDM performs best without needing returned-to-go at test time (Correia et al., 2024, Huang et al., 2024).
On Atari 1% DQN-replay, hidden-attention Mamba (DeMa) outperforms DT by 80% on average score, with improved sample efficiency (Dai et al., 2024). MambaDM’s global-local (GLoMa) mixer yields up to 45.6% improvement on Qbert (Cao et al., 2024).
In in-context RL and Grid World/Tmaze, DM-H achieves optimal recall at sequence lengths (horizons) where DT/DM baselines fail due to memory constraints, maintaining 30× speedup in online testing (Huang et al., 2024).
In 3D diffusion manipulation, hybrid Mamba-UNet achieves ≥80% parameter reduction and matches/exceeds UNet-based DP3 baseline on Adroit, MetaWorld, and DexArt; robustness to horizon scaling is demonstrated (Cao et al., 2024).
Domain-specific applications, e.g., agentic orchestration for 6G RAN slicing, show that HDM-based AI outperforms transformer-based and HRL baselines in throughput, cell-edge fairness, latency, and self-healing (Habib et al., 29 Dec 2025).
Ablation and scaling law studies indicate performance is more sensitive to dataset size than to model size in RL; further, DM’s benefits are pronounced in simple action/visual regimes, with transformer attention retaining an edge at high complexity (Yan, 2024, Cao et al., 2024).

6. Comparative Analysis and Applicability

Random Forest regression and correlation analyses on Atari reveal that DM excels in environments with low action- and visual-complexity (e.g., Breakout), where sequence modeling benefits from long-horizon context and gating, while transformers remain preferable for high-entropy, high-cardinality action spaces, or visually complex domains (Yan, 2024). For tasks with small action sets ( $h(t)$ 5), high compression ratio in visuals, or long, memory-demanding horizons, DM is an effective, efficient alternative (Huang et al., 2024).

7. Open Directions and Limitations

Current Decision Mamba architectures are limited by:

Potential information loss through selective scan, partially alleviated via multimodal mixers at the input layer (e.g., DMM’s local linear/conv fusion) (Kim, 2024).
Need for offline batch learning or in-context adaptation; direct on-policy extensions and policy gradient integration remain underexplored (Ota, 2024).
Sensitivity to choice of local prediction window $h(t)$ 6 and sub-goal extraction; adaptive selection and richer gating present promising future directions (Huang et al., 2024).
Robustness to non-Markov, high-entropy, or nonstationary contexts is an ongoing challenge (Lv et al., 2024, Habib et al., 29 Dec 2025).
Empirical and protocol analyses indicate dataset scaling is more beneficial to performance than model capacity scaling in standard RL benchmarks (Cao et al., 2024).

Collectively, Decision Mamba establishes selective state-space sequence modeling as a highly efficient and effective foundation for RL, supporting rapid progress in trajectory optimization, in-context RL, hierarchical/agentic architectures, and parameter-efficient imitation learning across diverse sequential tasks.