Linear Attention Sequence Parallelism (LASP)

Updated 10 April 2026

Linear Attention Sequence Parallelism (LASP) is a distributed training strategy that exploits the right-product-first computation to efficiently scale linear attention architectures.
It offers multiple algorithm variants (LASP-1, LASP-2, and ZeCO) that optimize communication patterns and maintain constant per-device memory even for multi-million token sequences.
Empirical results show that LASP achieves significant throughput improvements and near-linear scalability, making it ideal for large-scale linear sequence modeling.

Linear Attention Sequence Parallelism (LASP) is a specialized distributed training strategy designed to support efficient, scalable computation for linear sequence modeling architectures. These include linear attention-based transformers, linear RNNs, and state space models. LASP specifically addresses the challenge of scaling to extremely long sequences—on the order of 10⁵ to 10⁷ tokens—where conventional parallelism techniques such as Data Parallel (DP) and Tensor Parallel (TP) become impractical due to memory and communication bottlenecks. The core contribution of LASP is to exploit the “right-product-first” structure inherent to linear attention mechanisms to minimize communication volume, maintain constant per-device memory, and enable near-linear scaling with both sequence length and device count (Sun et al., 2024, Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025, Chou et al., 1 Jul 2025).

1. Mathematical Foundations and the “Right-Product-First” Property

LASP is grounded in the mathematical associativity of linear sequence modeling, which allows a fundamental transformation in computation order compared to softmax attention. In standard softmax-based attention, the output is computed as

$O = \mathrm{Softmax}(Q K^\top) V,$

costing $\mathcal{O}(N^2 d)$ for a sequence of length $N$ and hidden size $d$ . In linear attention, the key observation is that

$O = Q (K^\top V),$

enabling the deferred evaluation of the “right product” $K^\top V$ as a global state $M \in \mathbb{R}^{d \times d}$ that summarizes all past key/value pairs. This state can be updated recurrently as $M_s = M_{s-1} + k_s v_s^\top$ so that each token output requires only $O(d^2)$ additional computation and storage, independent of $N$ (Sun et al., 2024, Sun et al., 11 Feb 2025).

By exploiting this “right-product-first” structure, LASP reduces the dependency on sequence length for both forward and backward pass memory, making it feasible to handle ultra-long sequences even across modest GPU clusters (Sun et al., 7 Mar 2025).

2. Communication Patterns and LASP Algorithm Family

The LASP algorithm family has evolved through several stages—LASP-1 (ring P2P), LASP-2 (AllGather), and modern variants such as ZeCO (All-Scan)—each optimizing the balance between computation, communication cost, and overlap.

LASP-1 (Ring P2P):

Each device computes local chunk results and passes the cumulative memory state $\mathcal{O}(N^2 d)$ 0 to the next device in a unidirectional or bidirectional ring.
Achieves communication volume independent of sequence length—sending only a $\mathcal{O}(N^2 d)$ 1 state per device per layer per step—but incurs $\mathcal{O}(N^2 d)$ 2 sequential communication steps for $\mathcal{O}(N^2 d)$ 3 devices, limiting overlap and scalability as $\mathcal{O}(N^2 d)$ 4 increases (Sun et al., 2024).

LASP-2 (AllGather):

Replaces the P2P chain with a single collective AllGather on all local $\mathcal{O}(N^2 d)$ 5, so each device acquires all local memories.
The AllGather volume remains $\mathcal{O}(N^2 d)$ 6 but allows all devices to proceed independently, maximizing parallel compute overlap and minimizing wall time.
Key for its practical success is that $\mathcal{O}(N^2 d)$ 7 is much smaller than $\mathcal{O}(N^2 d)$ 8, yielding a sequence-length-independent communication footprint (Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025).

ZeCO (All-Scan):

Introduces a pipelined collective (“All-Scan”) that achieves the communication-theoretic minimum—one $\mathcal{O}(N^2 d)$ 9 state per device—further reducing per-device latency.
By partitioning the state and overlapping communication at fine granularity, ZeCO achieves near-perfect scaling up to hundreds of devices with 8M+ token sequences (Chou et al., 1 Jul 2025).

Generalized LASP Algorithm (no masking, simplified):

$O = Q (K^\top V),$ 6 Causal masking and prefix sum variants introduce additional local operations and communication patterns (PrefixSum, segmented reductions) (Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025).

3. Complexity Analysis, Scalability, and Comparison

Theoretical and empirical analyses demonstrate that LASP and its derivatives achieve nearly ideal scaling on long-sequence tasks, subject to hardware network characteristics.

Method	Comm/Layer (per device)	Grows with $N$ 0?	Per-Device Memory	Compute per device
DP	none/weight-all-reduce	no	$N$ 1	$N$ 2
TP	$N$ 3 all-reduce	yes	$N$ 4	$N$ 5
LASP-1 (ring)	$N$ 6	yes	$N$ 7	$N$ 8
LASP-2	$N$ 9 (AllGather)	yes	$d$ 0	$d$ 1
ZeCO	$d$ 2 (All-Scan)	no	$d$ 3	$d$ 4

As $d$ 5, LASP-2 and ZeCO’s communication overhead becomes negligible, and memory use is dominated by local activations and model parameters (Sun et al., 2024, Sun et al., 11 Feb 2025, Chou et al., 1 Jul 2025).

Empirically, LASP-2 achieves a 36.6% throughput improvement over “Ring Attention” and 15.2% over LASP-1 at $d$ 6K on 64 GPUs. ZeCO achieves further acceleration—up to 60% speedup over LASP-2 and scaling throughput linearly with GPU count up to 256 devices and 8M sequence lengths (Sun et al., 11 Feb 2025, Chou et al., 1 Jul 2025).

4. Extensions: Hybrid Attention, Mixture-of-Experts, and Kernelized RNNs

Hybrid (LASP-2H):

LASP-2H generalizes the AllGather pattern to support models interleaving linear and standard softmax attention layers.
In standard-attention blocks, AllGather is applied to $d$ 7, $d$ 8 for local softmax computation.
Enables efficient sequence-parallel hybrid models such as Linear-Llama3, mixing linear and softmax attention within a unified infrastructure (Sun et al., 11 Feb 2025).

Linear-MoE:

The SP method extends naturally to Linear Sequence Modeling (LSM) + Mixture-of-Experts architectures.
Partitioning the sequence dimension enables scaling MoE models to very long contexts, maintaining $d$ 9 memory and $O = Q (K^\top V),$ 0 communication per layer.
Hybrid MoE/transformer models apply SP on LSM kernels and analogous AllGather on transformer blocks (Sun et al., 7 Mar 2025).

Kernelized Linear RNNs (Tiled Flash Linear Attention – TFLA):

Implements two nested levels of sequence parallelism: chunkwise parallelism (across chunks) and tiled intra-chunk parallelism.
Achieves further hardware efficiency for models such as mLSTM, outperforming even FlashAttention 3 and Mamba kernels in both inference and training speed, and allowing arbitrary chunk size for memory/computation balance (Beck et al., 18 Mar 2025).

5. Empirical Results and Hardware Implementation

Extensive empirical benchmarking corroborates the scalability and efficiency of LASP. Critical findings include:

Throughput: LASP achieves near-linear speedup in tokens/sec as both sequence length and device count increase, remaining bottleneck-free where DP/TP collapse (Sun et al., 2024, Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025).
Memory: Per-device memory is nearly constant in $O = Q (K^\top V),$ 1, enabling multi-million token sequence training on commodity devices (e.g., 792K tokens/sec and 66 GB/GPU at $O = Q (K^\top V),$ 2M with 128 GPUs) (Sun et al., 2024).
Communication: LASP-2’s AllGather on $O = Q (K^\top V),$ 3 blocks consistently outperforms prior SP methods; ZeCO’s All-Scan is up to 4 $O = Q (K^\top V),$ 4 faster than AllGather on 256 GPUs (Chou et al., 1 Jul 2025).
Convergence: No statistically significant difference in downstream loss or accuracy compared to DP or traditional attention for fixed optimizer and model configuration (Sun et al., 2024, Sun et al., 11 Feb 2025).
Implementation: Effective overlap of compute and communication is attained by scheduling AllGather/All-Scan collectives via NCCL on dedicated CUDA streams; kernel fusion and caching of intermediate states further enhance hardware utilization (Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025).

6. Limitations, Open Problems, and Future Directions

Current Limitations:

LASP (ring and AllGather variants) requires careful alignment of SP group size with the device count (must evenly partition).
Ring-style communication introduces pipeline startup penalties for small $O = Q (K^\top V),$ 5.
Extension to full softmax attention with Flash-like kernels remains challenging due to lack of an associative update rule.
Integration with other parallelisms (TP, Pipeline, Expert Parallel) must avoid new synchronization barriers.

Open Directions:

Arbitrary and load-balanced chunking beyond uniform splitting.
Fully asynchronous overlap of compute and communication (potential for further pipeline efficiency).
Generalization to broader kernelized or structured models (e.g., Retentive Networks, advanced gating mechanisms).
Exploration of “softmax-friendly” SP collectives and further hybridization (Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025, Chou et al., 1 Jul 2025, Beck et al., 18 Mar 2025).

7. Significance in Large-Scale Model Training

LASP and its derivatives constitute the state of the art in distributed training for ultra-long-sequence, linear attention-based architectures. They have enabled context windows and model sizes previously out of reach of mainstream hardware, facilitated efficient training of production-scale Hybrid/MoE architectures, and serve as a theoretical and practical reference point for future advances in large-scale sequence modeling (Sun et al., 11 Feb 2025, Sun et al., 7 Mar 2025, Chou et al., 1 Jul 2025, Beck et al., 18 Mar 2025, Sun et al., 2024).