LASP: Linear Attention Sequence Parallel

Updated 4 October 2025

Linear Attention Sequence Parallel (LASP) is a distributed approach that leverages a right-product kernel trick to efficiently parallelize kernel-based linear attention models.
LASP uses ring-style and all-gather protocols to maintain fixed-size state exchanges and enable scalable training for ultra-long sequences.
LASP integrates kernel fusion, state caching, and compatibility with hybrid parallelism to achieve significant speedups and optimal communication efficiency.

Linear Attention Sequence Parallel (LASP) encompasses a family of algorithmic and systems techniques for efficiently parallelizing linear attention models across sequence length in distributed settings. Linear attention—the replacement of the quadratic softmax attention with a kernel-based, linear-complexity form—enables scalable sequence modeling, but practical deployment over ultra-long contexts and large clusters depends critically on tailored parallelization and communication strategies. LASP and its derivatives systematically address the bottlenecks of memory, compute, and inter-device communication in this context, with a range of innovations extending from ring-style and all-gather protocols to contemporary zero-overhead primitives.

1. Motivation and Problem Context

LASP is motivated by the need to train large-scale models (notably, LLMs and long-context sequence models) whose sequence lengths far exceed the memory capacity of a single device. While traditional sequence parallelism (SP) techniques exist for softmax attention-based architectures, their communication patterns and memory footprints are not optimal for models with linear attention. The fundamental observation is that linear attention mechanisms admit a “right-product kernel trick” permitting recurrence-style update rules and greatly reduced intermediate state. Efficiently distributing such models over clusters, while minimizing per-device memory and keeping communication independent of sequence length, is the core challenge LASP addresses (Sun et al., 3 Apr 2024).

2. Core Algorithmic Techniques

2.1 Right-Product Kernel Decomposition

In linear attention, the output at token $s$ can be computed as:

$o_s^\top = q_s^\top \cdot \mathrm{KV}_s, \qquad \mathrm{KV}_s = \sum_{i=1}^s \lambda^{s-i} k_i v_i^\top$

or, with chunking and right-product association,

$\mathrm{KV}_t = \lambda^{C} \cdot \mathrm{KV}_{t-1} + (\lambda^{C} \Lambda^{-1} K_t)^\top V_t$

where $C$ is the chunk length and $\Lambda = \mathrm{diag}(\lambda, \lambda^2, ..., \lambda^{C})$ .

This structure allows sequence-parallel implementations to exchange only fixed-size state tensors (e.g., $d\times d$ ), regardless of sequence length (Sun et al., 3 Apr 2024, Sun et al., 11 Feb 2025).

2.2 Parallelization Patterns

Two main approaches have evolved:

Ring-style Point-to-Point (P2P): Each GPU computes intra-chunk outputs, receives the prior chunk’s state from its left neighbor, performs an inter-chunk update, then sends the new state rightward (Sun et al., 3 Apr 2024).
AllGather-based (LASP-2): Each GPU computes an intermediate memory state (e.g., $M_t = K_t V_t$ for chunk $t$ ), and all memory slices are gathered and reduced via one all-gather per iteration. This allows computation/communication overlap (Sun et al., 11 Feb 2025).

These mechanisms are designed to ensure the volume of communicated data per device is independent of sequence length, depending only on model width and number of layers, and to allow scalable parallel decompositions.

2.3 Kernel Fusion and State Caching

To further improve efficiency, LASP systems often aggressively fuse intra-chunk and inter-chunk computations into single, custom GPU kernels and implement intermediate memory state caching (to avoid recomputation during backward passes). This results in hardware-optimized runtime and reduced global memory traffic (Sun et al., 3 Apr 2024, Yang et al., 2023).

3. Communication, Scalability, and Performance

3.1 Quantitative Communication Efficiency

A central goal is for the communication overhead to remain constant (per layer: $O(Bd^2/h)$ , with $B$ batch size, $d$ model dim, $h$ heads), independent of total sequence length. In contrast, traditional SP methods donate per-device workloads (and thus communication) scaling with sequence length.

Empirical studies show that LASP’s communication cost—and thus throughput—scales linearly or superlinearly as GPUs are added, supporting sequence lengths to 4096K tokens across clusters of 128 GPUs (8 $\times$ longer than prior methods) (Sun et al., 3 Apr 2024). LASP-2 improves training speed by 15.2% over the earlier LASP design and 36.6% over Ring Attention with explicit all-gather (Sun et al., 11 Feb 2025).

3.2 Hybrid and Heterogeneous Parallelism

Proper alignment with batch-level data parallel strategies is critical. LASP is explicitly engineered to be compatible with PyTorch DDP, FSDP, and ZeRO-based optimizations, permitting hybrid parallelism decomposing both batch and sequence dimensions (Sun et al., 3 Apr 2024). FlexSP generalizes LASP by adaptively grouping sequences with similar lengths—a linear programming approach to optimal device assignment—yielding up to 1.98 $\times$ end-to-end speedups in realistic LLM corpora, which exhibit highly heterogeneous sequence lengths (Wang et al., 2 Dec 2024).

3.3 Theoretical Lower Bound and Further Advances

ZeCO demonstrates “zero communication overhead” for linear attention SP by introducing the All-Scan primitive: a blockwise, pipelined receive–scan–send pattern that provides each device exactly and only the initial state it needs. Theoretical analysis shows ZeCO meets the lower bound for data transmission per iteration, avoiding the scaling issues present in all-gather or chained ring protocols. On 256 GPUs and 8M tokens, ZeCO provides about a 60% speedup over previously best-in-class SP (Chou et al., 1 Jul 2025).

4. Extensions, Hybrids, and Alternative Models

4.1 Hybrid Attention Architectures

LASP is extensible to models that blend linear and standard softmax attention. LASP-2H generalizes the all-gather workflow to standard attention: the $K$ / $V$ matrices of each chunk are all-gathered and the attention is computed locally on each device, enabling efficient SP across a spectrum of hybrid architectures (Sun et al., 11 Feb 2025).

4.2 State Space Models and Linear-MoE

The sequence parallelism ideas of LASP are also applicable to modern state space models (SSMs), matrix-valued RNNs, and MoE-based architectures, as described in the ParallelFlow framework (Cirone et al., 1 Apr 2025) and Linear-MoE system (Sun et al., 7 Mar 2025). These systems exploit parallel scan/flow discretizations, signature kernel techniques, and blockwise or tiled update schemes for efficient scaling across extremely long sequences.

4.3 Adaptive Memory and Expressivity

Recent approaches, such as log-linear attention (Guo et al., 5 Jun 2025) and LoLA (McDermott et al., 29 May 2025), focus on mitigating the limited expressivity and recall of fixed-size state linear attention. While not strictly communication or parallel scaling techniques, these improvements (e.g., hierarchical hidden state growth, sparse-cached KV memories) are synergistic with LASP infrastructure, as they preserve the low communication and memory footprint central to the approach.

5. Empirical Evaluation and Practical Impact

SP Method	Comms Scaling	Max Seq. Length (on 128 GPUs)	Throughput Speedup
Megatron-SP/DS	O(seq len)	Up to 512K	Baseline
LASP-1 (Ring)	O(1)	Up to 512K–1M	2–4×
LASP-2 (AllGather)	O(1)	Up to 2M–4M	15–36% over Ring, 4–8× over baseline
ZeCO (All-Scan)	Optimal (O(1))	Up to 8M	60% speedup over LASP-2

Benchmarks across TransNormerLLM, Linear-Llama3, and MoE/SSM-backed models demonstrate that LASP and its descendants retain near-constant memory and computation per device as sequence length and world size increase, with convergence and loss curves essentially matched to single-device, non-SP baselines (Sun et al., 3 Apr 2024, Sun et al., 11 Feb 2025, Chou et al., 1 Jul 2025).

6. Generalization and Future Directions

LASP’s foundational insight—a right-product factorization and parallel scan-based communication protocol—has broad applicability. It is suitable for arbitrary linearized sequence modeling methods including kernel-based attention, cosine or polynomial approximations, delta-networks, or SSMs. Emerging research emphasizes extending LASP to handle chunked, bidirectional, and hybrid recurrences (as in LION or log-linear attention) (Afzal et al., 22 Feb 2025, Guo et al., 5 Jun 2025), and to integrate further adaptive, resource-driven workload assignment as in FlexSP (Wang et al., 2 Dec 2024).

Anticipated future developments include deeper system-level optimizations (e.g., deeper kernel fusion, intra/inter-chunk tiling (Beck et al., 18 Mar 2025)), theoretical advances linking rough path theory to scan-based algorithms (Cirone et al., 1 Apr 2025), and deployment in domains beyond NLP, such as large-scale vision and multimodal sequence tasks (Liao et al., 28 May 2024).

7. Summary

Linear Attention Sequence Parallelism (LASP) represents a suite of optimization strategies and system designs to train linear attention-based sequence models on ultra-long contexts over distributed environments. By leveraging the mathematical structure of linear attention for communication minimization, kernel fusion, and compatibility with both batch and hybrid parallelism, LASP achieves scalable, efficient training while retaining or improving model performance. Continuing evolution in this area, exemplified by ZeCO, FlexSP, and the generalization to hybrid and alternative architectures, positions LASP techniques as foundational tools in next-generation large-scale sequence modeling (Sun et al., 3 Apr 2024, Sun et al., 11 Feb 2025, Chou et al., 1 Jul 2025, Wang et al., 2 Dec 2024).