Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

157 tokens/sec

GPT-4o

8 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Ulysses Sequence Parallelism

Updated 1 July 2025

Ulysses Sequence Parallelism (Ulysses SP) is an architectural method that partitions the sequence dimension of transformer models across multiple GPUs to enable training with extremely long inputs.
It utilizes an efficient all-to-all communication strategy where per-GPU communication volume remains constant when sequence length and devices scale proportionally, overcoming bottlenecks of prior methods.
This scalable approach makes training models with millions of tokens feasible for applications like long document analysis, genomics, and advanced conversational AI.

Ulysses Sequence Parallelism (Ulysses SP) is an architectural and systems parallelization method developed to enable efficient, scalable training of transformer models with extremely long input sequences, particularly in the context of LLMs and similar architectures. It was introduced in the DeepSpeed-Ulysses system (2309.14509), establishing a new paradigm for sequence-dimension parallelism in contrast to traditional techniques focused on batch size, hidden dimension, or model depth.

1. Fundamentals of Ulysses Sequence Parallelism

Ulysses SP addresses the need for training transformer models with sequence lengths ranging from hundreds of thousands to millions of tokens. Traditional parallelism techniques—data parallelism (batch), tensor parallelism (hidden size), and pipeline parallelism (model depth)—are inadequate for this axis due to memory or communication bottlenecks when handling extreme sequence lengths. Ulysses SP introduces a principled method of partitioning the sequence dimension itself across multiple GPUs, enabling computation and memory scaling along the sequence axis.

The core idea involves organizing the training workflow such that each GPU processes a distinct shard of the sequence for each input sample. For a total sequence length $N$ and $P$ GPUs, each device operates on $N/P$ tokens. This design is agnostic to the underlying attention mechanism and can be applied to both dense and sparse attention operations, including modern optimized kernels such as FlashAttention v2.

2. Distributed Attention and Communication Analysis

Within Ulysses SP, attention computation requires that each GPU eventually gains access to the full sequence for a subset of attention heads, while still maintaining sequence partitioning for memory efficiency. This is achieved through the following distributed workflow:

Each GPU computes $Q, K, V$ projections for its local sequence slice.
An all-to-all collective communication operation efficiently re-partitions these projections such that each device receives the full sequence for an assigned subset of heads.
Local attention computations are performed independently per head across GPUs.
Another all-to-all operation redistributes the output back to the original sequence split for subsequent feedforward and normalization layers.

The communication volume for each step is tightly bounded. For a transformer with sequence length $N$ , hidden dimension $h$ , and parallel degree $P$ , the per-layer total message size is $4Nh$. However, in all-to-all, each device sends and receives only $1/P$ of the total volume. Thus, the per-link communication complexity becomes:

$\text{Communication per GPU/link} = \frac{4 N h}{P}$

A salient property of Ulysses SP—demonstrated through theoretical and practical analysis—is that when both the sequence length $N$ and compute devices $P$ are increased proportionally, the per-link communication volume remains constant. This enables impressively scalable, high-throughput training for extreme sequence lengths. In contrast, prior sequence parallelism approaches, such as Megatron-LM SP, incur $O(N)$ per-link communication, forming the principal bottleneck at scale.

3. Performance Characteristics and Benchmark Results

Ulysses Sequence Parallelism has been empirically evaluated on GPT-style models up to 1.2B parameters and sequence lengths of up to 1 million tokens. In benchmarking against established baselines, key performance results include:

Throughput: Ulysses SP trains models 2.5× faster than Megatron-LM Sequence Parallelism for identical sequence lengths, and supports 4× longer sequence lengths on the same hardware. For sparse attention, the throughput improvement is at least 2× for the same sequence length.
Scaling Efficiency: Per-GPU throughput remains above 54% of hardware theoretical peak at extreme scales. Both strong scaling (fixed sequence length, increased GPUs) and weak scaling (increase in both GPUs and sequence length) demonstrate near-linear scaling and high parallel efficiency.
Training Convergence: Models trained under Ulysses SP regime exhibit equivalent convergence rates and final quality compared to non-sequence-parallel baselines; for example, GPT-1.3B trained with 32K sequence length on 8 GPUs showed no degradation in convergence.
Extensibility: Integration with ZeRO-3 allows distributed parameter and optimizer state sharding, further enabling the simultaneous training of large models and long context windows.

4. Comparison with Other Sequence Parallel Methods

A comparative analysis highlights Ulysses SP’s computational and system advantages:

Method	Comm. Complexity	Attention Agnostic	Memory Efficient	Usability/Ease of Use
ColAI-SP	$O(N)$	No	No	Nontrivial
Megatron-SP	$O(N)$	Limited	Limited	Tied to Megatron-LM
Ulysses SP	$O(N/P)$	Yes	Yes	Easy

The all-to-all-based approach enables communication complexity to decrease with the number of devices, unlike prior art where communication overhead is invariant to parallel degree. In addition, Ulysses is agnostic to specific attention implementations, supporting both dense and sparse variants, and requires minimal code changes for integration.

5. Scalability, Practical Implications, and Supported Applications

The principal scaling feature of Ulysses SP is its constant per-GPU communication cost for a fixed ratio of sequence length to GPU count. Empirical validation extends up to 256 GPUs, encompassing both small- and large-scale cluster topologies. This architectural property makes it uniquely suitable for domains demanding long-context models, including:

Long document and book summarization
Conversational AI with extended context memory
Multimodal foundation models spanning disparate modalities, e.g., text–image–audio
Genomics (e.g., entire human genome spanning ~6 billion tokens)
Electronic health records and patient longitudinal data
High-resolution weather and climate modeling

Ulysses SP also provides system portability: implementation is attention-kernel agnostic, compatible with standard distributed infrastructures, and can often be adopted by applying simple wrappers to existing transformer layers.

6. Limitations and Evolution

While Ulysses SP offers a substantial improvement in scalability, there are limitations and areas where further innovation has emerged:

The degree of sequence parallelism is upper-bounded by the number of attention heads (due to splitting QKV along the head dimension). This constraint can be problematic in architectures adopting grouped or multi-query attention with few KV heads.
In heterogeneous or multi-dimensional sequence domains (e.g., video transformers with spatial–temporal grids, or highly variable sequence length workloads), follow-up work such as Dynamic Sequence Parallelism (DSP) (2403.10266) and heterogeneity-adaptive SP (2412.01523) have extended the static, single-dimension splitting of Ulysses to dynamic and workload-adaptive modes.
For post-training and fine-tuning scenarios, practical tooling (e.g., 360-LLaMA-Factory (2505.22296)) has introduced variants like Dummy-Head-Ulysses to sidestep head-count division constraints and to better support compatibility with arbitrary model architectures.

7. Impact and Role within the Sequence Parallelism Ecosystem

Ulysses Sequence Parallelism serves as a foundational innovation in the LLM systems literature, with subsequent advances generalizing or building upon its communication-scalable all-to-all scheme. It has shifted the paradigm from model-scaling to context-scaling, enabling not only longer contexts but also making these capabilities feasible outside of the largest industrial research labs.

Implementations incorporating or derived from Ulysses SP (DeepSpeed-Ulysses, Arctic Long Sequence Training (2506.13996), 360-LLaMA-Factory, and others) have become standard in modern high-efficiency LLM training pipelines, particularly for research, open-source development, and deployment of context-rich models in diverse domains.

Ulysses SP provides the key architectural and system framework for scalable long-sequence transformer training, overcoming both memory and interconnect bottlenecks inherent in prior approaches. As the field of large sequence models continues to mature, Ulysses SP and its communication-optimized derivatives remain central to sustaining further advances in long-context neural LLMing.

PDF Markdown Chat (Upgrade)

References (5)

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models (2023)

DSP: Dynamic Sequence Parallelism for Multi-Dimensional Transformers (2024)

FlexSP: Accelerating Large Language Model Training via Flexible Sequence Parallelism (2024)

360-LLaMA-Factory: Plug & Play Sequence Parallelism for Long Post-Training (2025)

Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences (2025)