Linear-MoE: Linear Sequence Modeling Meets Mixture-of-Experts (2503.05447v2)

Published 7 Mar 2025 in cs.LG, cs.AI, cs.CL, and cs.DC

Abstract: Linear Sequence Modeling (LSM) like linear attention, state space models and linear RNNs, and Mixture-of-Experts (MoE) have recently emerged as significant architectural improvements. In this paper, we introduce Linear-MoE, a production-level system for modeling and training large-scale models that integrate LSM with MoE. Linear-MoE leverages the advantages of both LSM modules for linear-complexity sequence modeling and MoE layers for sparsely activation, aiming to offer high performance with efficient training. The Linear-MoE system comprises: 1) Modeling subsystem, which provides a unified framework supporting all instances of LSM. and 2) Training subsystem, which facilitates efficient training by incorporating various advanced parallelism technologies, particularly Sequence Parallelism designed for Linear-MoE models. Additionally, we explore hybrid models that combine Linear-MoE layers with standard Transformer-MoE layers with its Sequence Parallelism to further enhance model flexibility and performance. Evaluations on two model series, A0.3B-2B and A1B-7B, demonstrate Linear-MoE achieves efficiency gains while maintaining competitive performance on various benchmarks, showcasing its potential as a next-generation foundational model architecture. Code: https://github.com/OpenSparseLLMs/Linear-MoE.

PDF Abstract

The paper introduces Linear-MoE, a system designed for efficient modeling and training of large-scale Mixture-of-Experts (MoE) models integrated with Linear Sequence Modeling (LSM) modules. The system comprises a Modeling subsystem, which supports various LSM methods under a unified framework, and a Training subsystem, which facilitates efficient training through advanced parallelism technologies, including Sequence Parallelism (SP) tailored for Linear-MoE models. The paper explores hybrid models combining Linear-MoE layers with standard Transformer-MoE layers to enhance model flexibility and performance.

The key aspects and contributions include:

Linear-MoE System Overview: The Linear-MoE system integrates LSM with MoE to achieve high performance with efficient training, focusing on overcoming the quadratic computational complexity associated with traditional softmax self-attention in Transformers [Vaswani2017attention]. The system is composed of two main subsystems: Modeling and Training.
Modeling Subsystem: This subsystem provides a unified framework that supports various LSM methods, including linear attention, State Space Models (SSM), and linear Recurrent Neural Networks (RNN). Multiple instances of each type are implemented under a unified formulation:

$\widehat{\mathbf{M}_{s} = f(\mathbf{k}_s^{\top}, \mathbf{v}_s)}$ ,

$\mathbf{M}_{s} = \mathbf{\Theta}_s \diamond \mathbf{M}_{s-1} +\widehat{\mathbf{M}_s}$

where:
- $\mathbf{M}_s$ represents the memory state at the $s$ -th token
- $\widehat{\mathbf{M}_s}$ is the incremental memory update
- $\mathbf{k}_s$ and $\mathbf{v}_s$ are the key and value vectors, respectively
- $\mathbf{\Theta}_s$ is a coefficient matrix
- $\diamond$ denotes either standard matrix multiplication or Hadamard product
Specific instances of LSM methods include BLA, Lightning Attention, RetNet, GLA, DeltaNet, Rebased, GFW, GateLoop, Gated DeltaNet, TTT, Titans, S4, Mamba, Mamba2, HGRN2, RWKV6, and RWKV7.
Linear-MoE Architecture: The architecture consists of $L$ stacked Linear-MoE blocks, each including an LSM layer and an MoE layer, with a normalization layer preceding each. The LSM layer supports linear attention, SSM, and linear RNN. The MoE layers use standard mechanisms of sparse expert activation and routing. Hybrid architectures combine Linear-MoE layers with standard Transformer-MoE layers to improve recall-intensive tasks.
Training Subsystem: The Training subsystem incorporates SP for LSM modules to handle long input sequences efficiently. It also supports hybrid models with distinct computational and communication strategies tailored to different layer types. The SP for LSM modules involves a single collective communication operation on the memory state $\mathbf{M}_s \in \mathbb{R}^{d \times d}$ . For standard attention modules, an all-gather-based strategy is adopted, performing all-gather communication for $\mathbf K_s$ and $\mathbf V_s$ tensors across devices.
Parallelism Techniques: The system supports Tensor Parallelism (TP), Pipeline Parallelism (PP), and Expert Parallelism (EP) tailored for Linear-MoE models. These techniques can be combined with Data Parallelism (DP) and SP to enhance flexibility and scalability.
Variable Length Handling: The system simplifies handling variable-length sequences by processing the entire batch as one continuous long sequence, effectively managing varying sequence lengths without padding.
Implementation Details: The Linear-MoE system is implemented based on Megatron-Core, an open-source library developed on PyTorch. It is compatible with NVIDIA Tensor Core GPUs and supports FP8 acceleration on NVIDIA Hopper architectures. MegaBlocks is incorporated to optimize MoE training on GPUs using block-sparse operations, along with the Grouped GEMM library to accelerate computational processes.
Experimental Validation: Two series of Linear-MoE models, A0.3B-2B and A1B-7B, were pretrained from scratch on the SlimPajama corpus. The models were evaluated against baselines with standard attention and FlashAttention-2. Results indicate that Linear-MoE models maintain stable GPU memory consumption and consistent throughput as sequence length increases, unlike standard attention models, which exhibit a quadratic increase in memory usage and a decline in throughput.
Efficiency Results: The training efficiency was evaluated on eight A100 GPUs, measuring throughput and GPU memory requirements. Linear-MoE models exhibited relatively stable GPU memory consumption and consistent throughput as sequence length increased, in contrast to standard attention models. The inference efficiency comparison between Linear-MoE (using Basic LA) and Baseline (using FlashAttention-2) shows that Linear-MoE offers a significant speed advantage when the decoding length exceeds 16K, with constant memory usage.
Ablation Studies: Ablation studies on MoE optimization techniques (Grouped GEMM and MegaBlocks) and parallelism training methods demonstrate that these techniques significantly reduce the elapsed time for each iteration and provide advantages in terms of memory footprint and overall training efficiency.
Training Loss and Evaluation: Training loss curves for A0.3B-2B model instances, including pure and hybrid Linear-MoE models, demonstrate that the Linear-MoE architecture achieves competitive convergence performance compared to the standard attention baseline. Hybrid models exhibit more stable convergence and consistent performance.

In conclusion, Linear-MoE integrates LSM with MoE, offering a system for efficient and scalable large model training. It supports various LSM methods and advanced parallelism techniques, showing significant efficiency gains while maintaining strong performance across benchmarks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Weigao Sun (19 papers)
Disen Lan (7 papers)
Tong Zhu (43 papers)
Xiaoye Qu (62 papers)
Yu Cheng (354 papers)

Related Papers

Find Related Papers

GitHub

GitHub - OpenSparseLLMs/Linear-MoE (53 stars)

Tweets

https://twitter.com/HPCPapers/status/1899514037928345911