Sequence Parallelism in Transformers
- Sequence Parallelism is a distributed training paradigm that partitions the input sequence across multiple devices, overcoming memory and computation bottlenecks in transformers.
- It employs ring-style and collective communication patterns to efficiently compute cross-chunk dependencies in self-attention mechanisms.
- SP integrates seamlessly with data, tensor, and pipeline parallelism, driving scalability for long-context models in NLP, vision, and other applications.
Sequence Parallelism (SP) is a distributed training and inference paradigm developed to overcome the memory and computational scaling limitations of self-attention in large transformer models when operating on long input sequences. Rather than requiring each device to store and process the entire sequence, SP partitions the input sequence dimension across multiple accelerators. This allows memory consumption and compute to be distributed, significantly increasing the maximum trainable or inferable sequence length. SP is orthogonal to data, tensor, and pipeline parallelism, and is a foundational technique for long-context transformer models and multi-dimensional architectures.
1. Principle of Sequence Parallelism
Sequence Parallelism divides the input sequence into contiguous or blockwise chunks along the sequence (length ) so that each device processes tokens per forward pass. Each GPU holds and computes the local query, key, and value projections: While queries are always local, the complete attention output generally needs cross-chunk dependencies. SP addresses this via communication patterns, most notably ring-style peer-to-peer (P2P) or collective operations (AllToAll or AllGather), to facilitate efficient global attention computation across partitioned K and V representations. This schema ensures that:
- No device needs to hold all activations for all tokens.
- Self-attention (and MLP) activation memory scales with sequence length per device: .
- Sequence lengths achievable in practice increase by up to 3×, and batch sizes by up to 13.7× compared to purely tensor-parallel approaches (Li et al., 2021).
SP generalizes to multi-dimensional sequences, where splits may occur along temporal, spatial, or arbitrary axes and the active parallel dimension may be switched dynamically (Zhao et al., 15 Mar 2024).
2. Ring Self-Attention (RSA) and Communication Patterns
The canonical implementation for SP in standard transformers is Ring Self-Attention (RSA) (Li et al., 2021). The sequence is split across devices, and each GPU communicates its local K and V embeddings to others in "ring steps." At each ring step , a GPU computes: and accumulates partial attention outputs. After completing all rounds, each GPU has performed covering all sequence tokens, and a similar exchange is performed for to compose the final output
This yields a memory profile per device and dramatically extends possible sequence lengths.
Advanced variants include:
- AllToAll and AllGather-based strategies improve efficiency in specific hardware topologies or groupings (e.g., DeepSpeed-Ulysses, USP, 360-LLaMA-Factory).
- TokenRing uses bidirectional P2P to concurrently transmit (forward) and block outputs (reverse), reducing idle times and improving load balancing (Wang et al., 29 Dec 2024).
- TASP further decomposes the communication to utilize all links in AllToAll topologies via Hamiltonian cycle decomposition, providing up to speedup by orchestrating concurrent, non-overlapping ring transfers (Wang et al., 30 Sep 2025).
3. Integration with Other Parallelism Techniques
SP is designed for composability with existing parallelisms:
- Data Parallelism (DP): SP splits sequences within a batch per replica; DP independently splits across batches.
- Tensor Parallelism (TP): TP partitions model parameters (e.g., attention heads); SP partitions along the sequence. SP circumvents the scalability upper bounds set by the limited number of attention heads in TP (Li et al., 2021, Fang et al., 13 May 2024).
- Pipeline Parallelism (PP): Layers are partitioned across pipeline stages, orthogonally to SP's sequence splits, enabling 4D parallelism (Fujii et al., 10 Nov 2024).
- Context/Sequence Parallelism (CP): In some frameworks, CP refers to partitioning all activations along the sequence, with SP specifically targeting unpartitioned activations left by TP (e.g., RMSNorm outputs or FFN activations).
- Unified SP approaches (e.g., USP) arrange devices in a mesh and combine AllToAll and ring P2P communications, balancing the weaknesses and strengths of the base strategies (Fang et al., 13 May 2024).
The 4D parallelism configuration enables resource allocation across axes, supporting both very large models and multi-million-token contexts (Fujii et al., 10 Nov 2024).
4. Memory and Communication Efficiency
SP's primary benefit is a reduction in activation memory, enabling much longer sequences (e.g., from 3K up to 114K tokens on 64 GPUs versus 4K on a single device (Li et al., 2021)). Memory cost formulas for the attention block typically replace the sequence length with an effective per device. For example, the memory required in the multi-head attention block is reduced according to:
Communication cost in classic SP strategies can, however, become a bottleneck as SP degree increases: for example, standard ring-based approaches incur costs proportional to in the number of devices. Recent developments provide optimal or near-optimal communication volume:
- DSP reduces communication per block to $2M/N$, a quarter of even efficient AllToAll-based methods (Zhao et al., 15 Mar 2024).
- LASP/LASP-2 for linear attention leverages the right-product-first property to confine communication to intermediate states of constant size, independent of , requiring only a single AllGather (LASP-2 reduces ring steps to two AllGather rounds) (Sun et al., 3 Apr 2024, Sun et al., 11 Feb 2025).
- ZeCO eliminates redundant data movement by introducing All-Scan, transmitting only minimal state via pipelined block transmission for near-zero communication overhead, maintaining close-to-single-device runtimes in multi-GPU settings (Chou et al., 1 Jul 2025).
- In multi-dimensional settings, DSP dynamically adapts the split axis according to attention stage, reducing tensor reshaping and communication (Zhao et al., 15 Mar 2024).
5. Extensions: Adaptivity, Topology, and Heterogeneity
Recent research enhances SP along multiple axes:
- Adaptive and Heterogeneity-aware SP: FlexSP formulates sequence-to-SP-group allocation as a linear programming problem, allowing varying group sizes per token bucket depending on sequence-length distribution and resource constraints (Wang et al., 2 Dec 2024). This strategy is critical for real-world data (with long-tail sequence length distributions), achieving nearly speedup over static, homogeneous methods.
- Elastic SP for Serving: LoongServe’s Elastic Sequence Parallelism dynamically adjusts SP degree per request phase (prefill vs. decode), using proactive KV migration and multi-master decoding to minimize communication overhead and maximize GPU efficiency in dynamic workloads (Wu et al., 15 Apr 2024).
- Shift Parallelism: For inference, SP is adapted to ensure KV cache invariance with TP, enabling dynamic switching between configurations for optimal latency/throughput tradeoff under variable workloads (Hidayetoglu et al., 20 Sep 2025).
- Topology-aware SP: TASP decomposes communication into multiple concurrent ring-style data transfers matched to the network topology, utilizing all links and providing strong speedups in H100 and MI300X clusters (Wang et al., 30 Sep 2025).
6. Applications, Performance Metrics, and Benchmarks
SP is central for:
- Training and serving transformers with sequence lengths extending up to several million tokens per batch (e.g., ALST: 15M tokens over 32 GPUs) (Bekman et al., 16 Jun 2025).
- LLMs for RAG, document summarization, code generation, and long-form conversation.
- Video, image, and multi-modal diffusion models requiring long-range visual token dependencies (Fang et al., 4 Nov 2024).
- Protein folding, spatio-temporal forecasting, DNA sequence modeling, and any task with extensive context requirements.
Performance highlights from published studies include:
- increase in maximum batch size and sequence length relative to tensor parallelism at the same device count (Li et al., 2021).
- – training throughput improvement (LASP-2 vs. prior methods) on 64 GPUs with up to $2048$K tokens (Sun et al., 11 Feb 2025).
- Up to speedup over Zigzag-Ring Attention in communication-bound scenarios for topology-matched TASP (Wang et al., 30 Sep 2025).
- ALST supports up to longer sequences on Hugging Face models without source modification (Bekman et al., 16 Jun 2025).
- ZeCO achieves a speedup compared to the previous state-of-the-art on 256 GPUs and $8$M-token sequences (Chou et al., 1 Jul 2025).
- FlexSP achieves overall iteration speedup over static SP strategies across heterogeneous training data (Wang et al., 2 Dec 2024).
Key metrics studied in these works are memory consumption per device, communication latency (per attention block), effective throughput (TFLOPS and tokens/sec), and end-to-end iteration time under varying sequence lengths and device counts.
7. Limitations, Challenges, and Future Directions
Challenges for classical SP include communication overhead at scale (especially under ring P2P or suboptimal collective implementations), divisibility constraints in head-parallel variants, and inefficient load-balancing with heterogeneous sequence lengths. Solutions have centered on reducing communication to minimal, sequence-length-independent kernel states (e.g., LASP-2, ZeCO), developing adaptive grouping strategies (FlexSP), and optimizing for hardware topology (TASP).
Future work includes the extension of all-scan and tree-based communication primitives to novel attention operators, further automated optimization of SP group formation (potentially via reinforcement learning or Bayesian search), and full integration of SP with emerging 4D and 5D parallelism frameworks—encompassing model, sequence, batch, context, and spatial axes. Continued advances in SP are fundamental for scaling transformers and multi-dimensional models in domains where context length is the critical bottleneck for both memory and system scaling.