Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

144 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Routing Mamba (RoM)

Updated 7 July 2025

Routing Mamba (RoM) is an advanced SSM extension that integrates a shared sparse Mixture-of-Experts mechanism for enhanced expressivity and efficiency.
It employs a unified router to select the top-K experts across all projection layers, minimizing computational overhead and preventing learning instability.
Empirical results reveal that RoM matches dense model performance in language modeling while significantly lowering active parameter costs and inference time.

Routing Mamba (RoM) is an advanced architectural extension of linear State Space Models (SSMs), designed to expand model expressivity and computational efficiency for large-scale and long-range sequence modeling tasks. By integrating sparse Mixture of Experts (MoE) mechanisms into the Mamba SSM framework, RoM enables robust scaling of active parameter counts while maintaining constant inference-time computational and memory complexity. RoM resolves longstanding challenges in efficiently combining MoE with SSMs, achieving superior LLMing performance at a fraction of the active parameter cost and computational overhead compared to dense SSM baselines.

1. Foundations: State Space Models, Mamba, and the Emergence of Routing

Classical SSMs model sequences using recurrent updates with time-invariant, linear dynamics. Mamba extends this paradigm by introducing input-dependent gating (sometimes termed content-aware “routing”), enabling each token’s parameters—state updates, projections, and gating functions—to adapt to the data. This selective mechanism allows SSMs like Mamba to achieve Transformer-like modeling power for long dependencies while retaining linear scalability with respect to sequence length (2408.01129). However, while Mamba’s expressive power is significant, scaling SSMs to the regime of billions of parameters remains challenging: naïvely duplicating Mamba’s dense projections to build larger models leads to inefficiencies in both learning and computation.

RoM addresses these limitations by integrating sparse Mixture of Experts (MoE) directly into the SSM architecture. This adaptation leverages a shared routing strategy across the key Mamba projection submodules, ensuring expert activation remains efficient and effective.

2. Architectural Design and Sparse Expert Integration

RoM is constructed atop the canonical Mamba block, which consists of sequential projection layers: a convolutional projection (Conv Proj), a nonlinearity-gated projection (Gate Proj), and a standard output projection (Out Proj), all interfacing with a state-space operator. In RoM, each of these projections is replaced by a collection of expert projections, and a token-dependent router determines which experts to activate at each time step.

For an input token at time $t$ :

A router computes a softmax over projected token features with learnable routing weights, $\mathbf{W}_r$ .
The router selects the top $K$ experts (e.g., via $\text{TopK}$ or similar sparse selection).
Only the weights and computations for the top $K$ experts in Conv Proj, Gate Proj, and Out Proj are used for that token.

Formally, the routing and expert mechanism for the Gate projection is expressed as:

$\mathbf{G} = \text{SiLU}\left(\sum_{i=1}^N \mathbf{1}_{i \in \text{TopK}(\text{Softmax}(\mathbf{X}_t \mathbf{W}_r))} \cdot (\mathbf{X}_t \mathbf{W}_{g,i}) \right)$

where $N$ is the total number of experts and $\mathbf{W}_{g,i}$ is the weight matrix for expert $i$ in the gating projection.

This expertization is consistently applied across involved projections, with a shared router ensuring the same routing decision synchronizes across the Mamba block. The output for each time step is an expert-weighted sum:

$\mathbf{O}_t = \sum_{i=1}^N \mathcal{R}_i(\mathbf{X}_t) \cdot E_i(\mathbf{Y}_t, \mathbf{X}_t)$

where $E_i$ denotes the expert-specific output operation and $\mathcal{R}_i$ is the routing mask.

3. Shared Routing: Coherence and Efficiency

A salient feature of RoM is its shared routing mechanism, in contrast to assigning independent routers for each projection layer. Experimental evidence shows that independent routing introduces conflicts and learning instability in the context of interdependent projections within Mamba. RoM’s single-router design guarantees routing decisions are applied synchronously to all sparse expert projections per token. This unified approach reduces overhead and fosters parameter synergy, closely paralleling the routing behavior in transformer feedforward-MoE systems.

The selection process, for each token, thus involves:

A router computing a distribution over $N$ experts,
Masking to allow only $K \ll N$ experts to process the token,
Reusing the decision for all projections, enabling computational and implementation efficiencies.

This translates to significant FLOP savings, as only a subset of the expert parameters are active per forward pass, with empirical analysis showing up to 23% FLOPS saving relative to dense Mamba models of comparable performance (2506.18145).

4. Empirical Performance and Scaling Laws

RoM demonstrates strong empirical gains in both efficiency and modeling performance. For LLMing benchmarks on SlimPajama and other datasets:

RoM at 1.3B active parameters (10B total, with only the top $K$ per token used) achieves equivalence in perplexity to a dense Mamba with over 2.3 $\times$ more active parameters.
RoM maintains context-length-robustness, exhibiting consistent perplexity when evaluated on sequences longer than those observed during training—important for tasks involving extended documents or code generation.
In hybrid architectures such as Samba (combining SSM and attention blocks), RoM maintains its efficiency advantage, with decreased active parameter cost and inference-time requirements.

Empirical results support the conclusion that the RoM approach enables scalable SSMs, achieving the benefits of MoE familiar from large transformer models, but without the substantial overhead that previously afflicted naïve combinations of SSMs and MoE (2506.18145).

5. Applications Across Sequence Modeling Domains

RoM’s approach to scalable expert routing within SSMs unlocks new capabilities for a range of long-context sequence modeling use cases:

LLMing: RoM enables the construction of large, efficient generative models, suited for understanding and producing very long documents, code, or conversational contexts.
Hybrid Architectures: RoM’s expert routing can be combined with other scalable MoE mechanisms, as in systems mixing attention (transformer) and SSM (Mamba) modules, yielding compound efficiency and performance benefits.
Long-sequence Domains: The efficiency and scalability inherent to RoM are anticipated to benefit time series analysis, video modeling, and bioinformatics, particularly scenarios demanding both model expressivity and low inference cost.

A plausible implication is that such a design could accelerate the adoption of SSMs for real-world, resource-constrained, or latency-sensitive deployments.

6. Comparison with Prior SSMs and MoE Integrations

RoM was motivated by negative outcomes from naïve application of MoE strategies to SSM projection layers. When each projection (Conv, Gate, Out) had a separate router, learning dynamics degraded, and model performance suffered. RoM’s design corrects this by enforcing shared routing, which preserves the functional interdependence required in the SSM context.

A summary comparison is provided in the following table:

Model	Active Parameters	Perplexity (Lower is Better)	Relative FLOPs
Dense Mamba	High	Baseline	Baseline
Naive SSM+MoE	High	Degraded	High
RoM	Low	Matches Dense Mamba	0.77×

This demonstrates that RoM achieves performance parity with dense models at much lower computational cost, where earlier MoE-SSM attempts failed.

7. Limitations and Future Directions

Despite its strengths, RoM exhibits several open challenges:

Expert Configuration Tuning: Further research on optimal expert groupings, expert count, and routing sparsity will be required to maximize efficiency and capacity without negative side effects.
Wider Applicability: The extension of RoM’s shared routing mechanism to self-attention or hybrid models remains prospective; its methodological influence may extend, but further empirical validation will be necessary.
System-level Optimization: As RoM’s efficiency hinges on the precise implementation of sparse routing and expert parameterization, ongoing systems research will determine its practical deployment suitability in varying hardware environments.

Future investigations suggested in (2506.18145) include broader exploration of shared routing concepts in alternate architectures and more systematic integration with multimodal and cross-domain sequence modeling frameworks.

Routing Mamba represents a significant progression in the integration of sparse expert architectures with linear state space models, resolving critical bottlenecks in SSM scalability. By channeling the routing paradigm in a coherent, shared manner across interdependent projections, RoM broadens the practical feasibility of SSMs for high-capacity, efficient long-range sequence modeling.

PDF Markdown Chat (Upgrade)

References (2)

A Survey of Mamba (2024)

Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection (2025)