LASP-2: Efficient Sequence Parallelism
- LASP-2 is a method for sequence parallelism that minimizes hardware bottlenecks by replacing serialized point-to-point communication with collective AllGather operations.
- It improves training throughput by up to 36.6% over traditional ring-based methods, reducing communication rounds to a fixed count regardless of sequence length.
- LASP-2H extends this approach to hybrid attention models, unifying linear and softmax layers to enable scalable training of transformers with ultra-long input sequences.
LASP-2 refers to a sequence parallelism method for linear attention and hybrid attention transformer architectures, specifically designed to maximize efficiency and scalability during distributed training with very-long input sequences. The method addresses hardware bottlenecks and communication overhead that limit existing sequence parallelism algorithms by reorganizing their communication-computation workflow. LASP-2 also generalizes to hybrid networks containing both linear and standard (softmax-based) attention layers and has been empirically shown to improve distributed training throughput on very long contexts (Sun et al., 11 Feb 2025).
1. Motivation and Design Principles
Existing sequence parallelism (SP) approaches for linear attention, such as the original LASP (hereafter LASP-1), employ a ring-style, point-to-point communication topology. In LASP-1, each computing node (GPU) processes a segment (chunk) of the input sequence and sequentially passes intermediate key-value (KV) memory states to its neighbor. This design imposes two primary limitations: (1) communication is serialized, limiting parallelism and causing synchronization overhead, and (2) multiple small point-to-point transfers hinder effective communication-computation overlap.
LASP-2 introduces an architecture that rethinks the minimum necessary communication for linear attention. The key principle is the realization that for linear attention layers, the required communication is fully characterized by the per-chunk [B, H, d, d] memory state (where B is batch size, H attention heads, d hidden dimension) and is independent of the overall sequence length. By explicitly leveraging this, LASP-2 can collapse many sequential communications into a single collective operation per pass.
2. Communication-Computation Workflow
The core change in LASP-2 is the replacement of multiple point-to-point send/recv operations with a single AllGather collective communication for intermediate memory states:
- Each chunk (sequence segment processed on a device) computes its local memory state, denoted Mt = Kt × Vt, where Kt and Vt are the key and value matrices for chunk t.
- After independent computation, all devices perform a single AllGather of Mt, producing a globally consistent set of intermediate states on every device.
- The communication overhead per iteration is thus reduced to two AllGather steps (one each for the forward and backward passes), as opposed to 2(W–1) point-to-point steps in LASP-1 (W is the number of devices).
- Since per-device communicated data is O(BHd2) and does not scale with sequence length, this architectural choice allows communication cost to remain flat as the context length grows.
This redesign enables greater computation parallelism because all devices can process their sequence chunks in parallel, waiting only at the AllGather barrier. It also allows for better overlap between computation and communication as the AllGather is collectively scheduled, reducing idle time.
3. Performance Metrics and Empirical Results
Empirical evaluation was performed using a Linear-Llama3 model—a variant of Llama3 with linear attention in place of softmax attention. With a sequence length of 2048K tokens across 64 GPUs:
- LASP-2 improved training throughput by 15.2% compared to LASP-1.
- LASP-2 yielded a 36.6% throughput enhancement relative to Ring Attention (a standard ring-style sequence parallelism).
- The communication cost per iteration, using theoretical analysis, was reduced from 2(W–1)·I·B·H·d² (LASP-1) to just 2·I·B·H·d² (LASP-2), where I is the number of iterations.
- The method achieves these gains without sacrificing model convergence or scalability, as measured by throughput, memory consumption, and loss curves.
A summary comparison of communication cost is given in the following table:
Method | # Communication Steps | Per-step Data Size | Scaling with Sequence Length |
---|---|---|---|
LASP-1 (Ring) | 2(W–1) | O(B·H·d²) | Independent |
LASP-2 (AllGather) | 2 | O(B·H·d²) | Independent |
This pattern enables practical scaling to sequence lengths in the multi-million token range and efficient use of large GPU clusters.
4. Extension to Hybrid Attention Models (LASP-2H)
Standard softmax attention, though effective for recall and long-context reasoning, incurs quadratic time and memory cost, limiting its ability to benefit from sequence parallelism in large-scale training. LASP-2H extends the collective-communication paradigm of LASP-2 to standard attention layers by restructuring the communication workflow:
- For hybrid models containing both linear and standard attention layers, LASP-2H applies AllGather-based communication not just to linear attention modules but also to softmax attention, replacing ring-based context parallelism.
- This unified approach enables efficient sequence parallelism for architectures that interleave linear and softmax layers—a common pattern in contemporary transformer designs optimized for both efficiency and recall.
- The net effect is a single, hardware-friendly parallelism protocol that supports heterogeneous stacking of attention mechanisms.
This extension makes LASP-2 a scalable default SP for future ultra-long-context transformers with hybrid attention blocks.
5. Applicability and Software Availability
LASP-2 and LASP-2H are relevant in distributed training and inference scenarios that involve very-long input sequences, such as:
- LLMs with million-token contexts,
- Genomic sequence modeling,
- Multi-modal data with high temporal granularity.
The codebase implementing LASP-2 and its hybrid version (LASP-2H) is publicly available as part of the Linear-MoE repository: https://github.com/OpenSparseLLMs/Linear-MoE.
6. Significance and Limitations
LASP-2 addresses a principal bottleneck in SP for linear attention by converting communication from serialized point-to-point steps to a single AllGather, with data volume invariant to sequence length. This is especially impactful for scaling up distributed training with ultra-long contexts, where hardware and bandwidth efficiency become critical.
By providing a communication-computation pattern capable of handling mixed attention architectures, LASP-2(H) is positioned as a fundamental building block for next-generation large-scale transformers.
A plausible implication is that this paradigm—minimizing collective communication rounds and decoupling data volume from sequence length—could influence a broader class of data-parallel and sequence-parallel training methods in long-context neural modeling.