Mamba-CL: Continual Learning with S6 Models

Updated 22 April 2026

Mamba-CL is a family of approaches that integrates S6 models into continual learning, offering fixed-memory operation and linear compute scaling.
It employs selectivity regularization and null-space consistency constraints to maintain sample efficiency and prevent catastrophic forgetting.
Mamba-CL achieves state-of-the-art performance in reinforcement learning and vision domains by outperforming Transformer-based and rehearsal-free methods.

The Mamba-CL algorithm denotes a family of approaches that utilize Selective State-Space Models (S6) within the Mamba architecture to address challenges in continual learning (CL), meta-continual learning (MCL), and scalable algorithm distillation, including in complex continuous-control and vision domains. Mamba-CL methods fundamentally exploit the S6 model’s linear-scaling, attention-free sequence modeling and introduce CL-specific mechanisms such as selectivity regularization and null-space consistency constraints to achieve high sample efficiency, fixed-memory operation, and robustness against catastrophic forgetting, outperforming conventional Transformer-based and rehearsal-free CL systems across benchmarks (Beaussant et al., 16 Jun 2025, Zhao et al., 2024, Cheng et al., 2024).

1. Foundations and Model Architecture

Mamba-CL adopts the S6 (Selective Structured State-Space) model, a discrete-time recurrent architecture parameterized as: $h_t = S_t \odot h_{t-1} + G_t \odot z_t, \quad u_t = C_t h_t + D z_t$ where $z_t \in \mathbb{R}^M$ is the input embedding, $S_t, G_t \in \mathbb{R}^M$ are input-dependent, per-channel gates produced by small convolutional and linear projections (followed by a sigmoid), and $C_t, D$ are readout parameters. The model eschews explicit attention in favor of rich, tokenwise dynamics, supporting long-sequence modeling at linear compute and memory cost. Each transition tuple, e.g., $c_t = (s_t, a_t, r_t, s_{t+1})$ for RL or $(x_t, y_t)$ for classification, is projected via a learned embedding to $d_\mathrm{model}$ and fed to stacked S6 blocks with LayerNorm, feed-forward MLP (GELU activations), and residual connections (Beaussant et al., 16 Jun 2025, Zhao et al., 2024).

2. Continual and Meta-Continual Learning with Mamba-CL

Mamba-CL addresses the CL problem by processing a non-stationary data stream as a sequential prediction task. For each new sample or task, the recurrent hidden state $h_t$ is updated; new predictions $\hat{y}_t = f_\theta(h_{t-1}, x_t)$ are emitted with no explicit storage or recomputation of past samples (Zhao et al., 2024). Unlike Transformers, which require a linearly growing memory for historical key-value pairs, the fixed-size hidden state of SSMs enables strict memory control and efficient adaptation.

For MCL, meta-learning is performed episodically: a train stream $E^{\mathrm{train}}$ and test set $z_t \in \mathbb{R}^M$ 0 are sampled per episode. The meta-objective is: $z_t \in \mathbb{R}^M$ 1 where $z_t \in \mathbb{R}^M$ 2 is the main loss (e.g., cross-entropy, MSE) and $z_t \in \mathbb{R}^M$ 3 is the selectivity regularizer enforcing content-based gating analogous to attention (Zhao et al., 2024).

3. Selectivity Regularization and Null-Space Consistency

The selectivity regularizer is motivated by the soft-attention weights in Transformers, which encode relevance of all prior representations to the current query. For Mamba-CL, this property is emulated by associating the model's gating patterns $z_t \in \mathbb{R}^M$ 4 with a pseudo-attention matrix; the KL divergence between their softmaxed values and “ground-truth” association vectors (e.g., label equality matches) guides the network to focus on semantically relevant past tokens without explicit caching: $z_t \in \mathbb{R}^M$ 5 (Zhao et al., 2024).

In another Mamba-CL instance, updates to the core SSM parameters $z_t \in \mathbb{R}^M$ 6 are restricted to the orthogonal complement (null space) of the feature subspaces of previous tasks, theoretically guaranteeing output consistency and preventing catastrophic forgetting. These constraints are: $z_t \in \mathbb{R}^M$ 7 Gradient projections enforce these null-space updates, requiring only the storage of low-rank projectors per SSM block (Cheng et al., 2024).

4. Training Procedure and Computational Efficiency

The Mamba-CL training loop proceeds by sampling meta-train episodes or tasks; for each, the hidden state is initialized, and the sequence is processed one token at a time. Training minimizes the primary loss and selectivity regularizer (if present). In the orthogonality-based CL version, after each task, new projectors are constructed via SVD of old-task features, and future gradients are projected into their null spaces (Cheng et al., 2024).

Key efficiency gains are as follows:

Model	Time Complexity	Memory Complexity	Maximum Context Length (single RTX 3060)
Transformer	O(L²·d_model)	O(L·d_model)	∼400 tokens
Mamba-CL (S6)	O(L·d_model)	O(d_model)	2,000+ tokens

Linear scaling in sequence length enables Mamba-CL to operate on long, multi-episode RL histories and long CL task streams that are infeasible for quadratic-cost Transformers (Beaussant et al., 16 Jun 2025, Zhao et al., 2024).

5. Applications: Algorithm Distillation, Reinforcement Learning, and Vision CL

In in-context RL, Mamba-CL utilizes large-scale offline trajectories from SOTA RL agents (e.g., PPO, SAC, DroQ) across multiple continuous-control tasks. The network autoregressively predicts next actions given all past transitions, allowing test-time adaptation to unseen tasks through in-sequence contextualization alone—no online gradient steps are taken during evaluation (Beaussant et al., 16 Jun 2025). In vision CL, a pretrained SSM backbone is fine-tuned on sequential class-incremental tasks with feature-consistency constraints, providing state-of-the-art anti-forgetting on benchmarks such as 10-/20-split ImageNet-R, CIFAR-100, and DomainNet (Cheng et al., 2024).

Empirical performance improvements over Decision Transformer and rehearsal-free CL baselines are consistent, with Mamba-CL showing better asymptotic adaptation, less catastrophic forgetting (e.g., accuracy up from 63.7% to 81.7% on 10-split ImageNet-R), and higher sample efficiency across both RL and vision domains (Beaussant et al., 16 Jun 2025, Zhao et al., 2024, Cheng et al., 2024).

6. Hyperparameters, Ablations, and Robustness

Critical hyperparameters include the S6 state size ( $z_t \in \mathbb{R}^M$ 8 for RL, up to 2048 for vision), convolutional kernel size for gates, meta-batch size, selectivity regularization weight ( $z_t \in \mathbb{R}^M$ 9), learning rates (Adam optimizer), and balance factor $S_t, G_t \in \mathbb{R}^M$ 0 for null-space projection. Experiments confirm that S6 state sizes above 128 yield diminishing returns, while selectivity regularization is stable over a broad $S_t, G_t \in \mathbb{R}^M$ 1 range (Zhao et al., 2024). Mamba-CL is robust to input noise—unlike Transformers, which fail under high-noise streams—and performs well under large domain shifts and long or short meta-test episodes.

Ablation studies show that shorter context lengths (L ≪ trajectory length) sharply degrade performance in complex RL tasks, confirming the need for long in-context horizons. Replacing S6 blocks with standard self-attention degrades both performance and feasible sequence length, underscoring the advantage of SSMs for these regimes (Beaussant et al., 16 Jun 2025).

7. Limitations and Future Prospects

Mamba-CL requires access to extended, high-quality learning trajectories for pretraining, which may be prohibitively expensive in resource-constrained or real-world systems. While substantially advancing context length in RL and vision CL, most benchmarks still involve modest sequence horizons (≤200 for RL tasks). Understanding performance on extreme long-horizon tasks (≫10,000 steps), investigating data mix strategies (e.g., expert versus diverse rollouts), and fusing SSMs with architectures such as diffusion-based predictors, prompt-tuned modules, or MoE (Mixture-of-Experts) S6 variants represent active directions for future research (Beaussant et al., 16 Jun 2025, Zhao et al., 2024).

Overall, the Mamba-CL family demonstrates that SSM-based architectures, when equipped with CL-oriented regularization and update constraints, can deliver state-of-the-art continual adaptation accuracy, robust anti-forgetting, linear sequence scalability, and practical efficiency, advancing the state of the art in continual, meta-continual, and in-context learning across multiple domains.