Papers
Topics
Authors
Recent
Search
2000 character limit reached

MossNet: Dual MoE State-Space Models

Updated 18 April 2026
  • MossNet is a novel architectural paradigm that employs dual MoE mechanisms within a recurrent state-space model to replicate linear multi-head attention.
  • It integrates channel-wise MLP-MoE and time-wise SSM-MoE routing to boost model expressiveness and enable efficient long-range dependency learning.
  • Empirical evaluations demonstrate MossNet's scalability, hardware adaptability, and competitive performance through constant memory usage in long-context scenarios.

MossNet refers to a recent architectural paradigm in LLMs that implements a dual Mixture-of-Experts (MoE) mechanism within a recurrent state-space modeling (SSM) framework. MossNet is designed to directly emulate linear multi-head attention (MHA) via parallel channel-wise and time-wise expert routing, offering a balance of expressiveness, efficiency, and hardware adaptability for language modeling. This approach addresses longstanding limitations of conventional SSM and gated-recurrent models, which typically approximate only a single attention head and thus restrict long-range contextual modeling. MossNet was introduced in "MossNet: Mixture of State-Space Experts is a Multi-Head Attention" (Tuli et al., 30 Oct 2025).

1. Theoretical Motivation and SSM-MoE Equivalence to MHA

Traditional SSM-based models (such as Mamba) formalize sequence modeling using discretized ODEs of the form:

xt=Atxt1+Btut,yt=Ctxt+Dtutx_{t} = \overline{A}_{t} x_{t-1} + \overline{B}_{t} u_{t}, \quad y_{t} = C_{t} x_{t} + D_{t} u_{t}

where xtx_{t} is the latent state, utu_{t} is the input, yty_{t} the output, and A,B,C,DA,B,C,D are parameter matrices (or functions). In prior art, SSM variants make these parameters input-dependent using lightweight gating networks, but they typically employ a single implicit attention head over the entire sequence.

MossNet augments this by integrating two MoE mechanisms:

  1. Channel-mixing (MLP-MoE): Multiple MLP experts are routed per token to increase model capacity and representational diversity in the feed-forward (channel mixing) path.
  2. Time-mixing (SSM-MoE): Multiple SSM parameter sets ("experts") are dynamically routed per token during state updates to instantiate multiple, independent "attention heads" in the temporal domain.

The crucial result demonstrated in (Tuli et al., 30 Oct 2025) is that this Mixture-of-Experts parameterization of SSM kernels mathematically recovers linear multi-head attention, as the unrolled SSM with MoE gating yields a double sum analogous to a multi-query, multi-head attention operator:

yt=m,ni=1tqtm,kinviy_t = \sum_{m,n} \sum_{i=1}^t \langle q_t^m, k_i^n \rangle v_i

with qtmq_t^m and kink_i^n forming query/key projections over routing-induced expert heads, and values viv_i corresponding to the source input. Thus, MossNet equips the SSM with a trainable, efficient surrogate for linear MHA.

2. Architectural Design

MossNet incorporates dual MoE paths within a Mamba-like SSM backbone. For each block, it consists of:

  • MLP-MoE sublayer: The standard two-layer MLP is replaced by a top-kk MoE, where for each token, the softmax router activates xtx_{t}0 experts, and their outputs are linearly combined according to expert-wise probabilities and a load balancing loss coefficient xtx_{t}1.
  • SSM-MoE sublayer: The SSM parameter functions xtx_{t}2 and xtx_{t}3 (and optionally xtx_{t}4) are implemented as weighted sums over xtx_{t}5 and xtx_{t}6 independent kernels. The gating is based again on the input and allows a small number of experts (xtx_{t}7 active per-token) to be selected.

Key architectural hyperparameters across reported variants are:

  • Number of SSM experts per block (typ. 8)
  • Top-xtx_{t}8 routing (typ. xtx_{t}9 or utu_{t}0)
  • Hidden width, number of heads, number of layers (e.g., 128–1024 hidden, 2–16 heads, 16–30 layers)
  • Sparse activation: only the top-utu_{t}1 experts per token are active, preserving computational efficiency

In all designs, MossNet provides standard dense and top-utu_{t}2 MoE operation modes, allowing models to trade off expressiveness with computational footprint.

3. Training Regimes and Scaling

Empirical results are provided for MossNet models of various scales:

  • MossNet-8×8M: 19.7M total/9.9M active params, 16 layers, 2 heads, Cosmopedia (22B tokens)
  • MossNet-8×20M: 63.9M total/26.1M active params, 4 heads
  • MossNet-8×66M: 325.9M total/102.9M active params, 8 heads
  • MossNet-8×200M⁺: 1.5B total/0.5–0.7B active params, 16 heads (large variant), 2.8T tokens

All models are trained using the AdamW optimizer, with typical settings of utu_{t}3, 1% warmup, cosine decay to 10% final, and utu_{t}4 for MoE load-balance. Context lengths of up to 32,000 tokens are supported, with linear memory and runtime scaling guaranteed by the SSM formulation.

4. Experimental Outcomes and Benchmark Results

Language Modeling

MossNet models consistently outperform both pure SSMs and Transformer/MoE architectures of similar size. On the Cosmopedia holdout set:

  • MossNet-8×8M: Perplexity 13.1, versus 13.4 (Mixtral-8×8M), 13.5 (Mamba2)
  • MossNet-8×66M: MMLU (5-shot avg.) 20.1%, versus 14.7 (MoE-Mamba), ≤16% (other ~60M baselines)

Commonsense and Downstream Tasks

For downstream zero-shot performance on tasks such as ARC, BoolQ, HellaSwag, PIQA, WinoGrande:

  • MossNet-8×8M achieves 37.1% average (vs 36.4% Mixtral-8×8M, 36.2% Mamba-8M)
  • Large variant (MossNet-8×200M⁺) outperforms Qwen2.5-0.5B by 5.8 points average on a 7-task suite (53.5% vs 47.7%), and in top-3 MoE mode (∼700M) scores 55.4% vs. Mamba-790M’s 43.8%

Device Profiling

Real-device tests on NVIDIA A100 (FP16, FlashAttention 2) and Samsung Galaxy S24 Ultra (CPU, Q8) reveal:

  • MossNet-8×200M⁺ memory footprint: ∼8.4GB at 32K tokens (A100), ∼1.6GB at 32K tokens (mobile)
  • Prefill/generation throughput competitive with or better than Llama3, Mamba
  • Constant memory and high throughput are maintained as sequence length grows, affirming suitability for long-context and on-device inference

5. Strengths, Limitations, and Distinctive Properties

Advantages

  • Expressiveness: Multi-head routing enables modeling of richer temporal phenomena than single-head SSMs, overcoming bottlenecks in capturing long-range or parallel dependencies.
  • Hardware Adaptability: Linear scaling in both runtime and memory; MoE amplitudes mean only top-utu_{t}5 experts are evaluated, maintaining practical efficiency for diverse devices.
  • Scalability: Shown to scale from ∼10M to ∼1.5B total parameters (∼500–700M active), maintaining empirical gains across size regimes.
  • Resource Stability: Fixed memory and graceful throughput degradation for long sequences, critical for both cloud and mobile/edge LLM deployment.

Limitations

  • Increased architectural and implementation complexity due to routing in both channel and temporal (SSM) domains
  • MoE efficiency gains saturate as batch heterogeneity increases (i.e., diminishing returns for highly diverse server workloads)
  • Evaluation limited to text-only tasks; no results reported for multimodal or RL extensions
  • Hardware-specific profile; generality to other chipsets not established

6. Comparison to Prior Models and Future Directions

MossNet's theoretical advance is the explicit recovery of linear multi-head attention within the provably efficient SSM context, a property not possessed by prior SSM or gated recurrent memory models. Unlike conventional Transformer-based MoE, which typically applies gating only in the channel-mixing path and incurs dense compute in attention, MossNet distributes expert capacity across both feed-forward and temporal modeling layers.

Research directions highlighted include:

  • MoE on additional SSM parameters (utu_{t}6) for finer temporal control
  • Dense or adaptive multi-query/multi-key patterns
  • Cross-token or grouped-query router mechanisms
  • Specialist pruning and hardware-aware MoE routing
  • Extension to non-text domains (multimodal, RL)

A plausible implication is that this architectural template generalizes beyond language to any sequential modeling domain where hardware efficiency and rich attention-style modeling are required (Tuli et al., 30 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MossNet.