Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
8 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ulysses Sequence Parallelism

Updated 1 July 2025
  • Ulysses Sequence Parallelism (Ulysses SP) is an architectural method that partitions the sequence dimension of transformer models across multiple GPUs to enable training with extremely long inputs.
  • It utilizes an efficient all-to-all communication strategy where per-GPU communication volume remains constant when sequence length and devices scale proportionally, overcoming bottlenecks of prior methods.
  • This scalable approach makes training models with millions of tokens feasible for applications like long document analysis, genomics, and advanced conversational AI.

Ulysses Sequence Parallelism (Ulysses SP) is an architectural and systems parallelization method developed to enable efficient, scalable training of transformer models with extremely long input sequences, particularly in the context of LLMs and similar architectures. It was introduced in the DeepSpeed-Ulysses system (2309.14509), establishing a new paradigm for sequence-dimension parallelism in contrast to traditional techniques focused on batch size, hidden dimension, or model depth.

1. Fundamentals of Ulysses Sequence Parallelism

Ulysses SP addresses the need for training transformer models with sequence lengths ranging from hundreds of thousands to millions of tokens. Traditional parallelism techniques—data parallelism (batch), tensor parallelism (hidden size), and pipeline parallelism (model depth)—are inadequate for this axis due to memory or communication bottlenecks when handling extreme sequence lengths. Ulysses SP introduces a principled method of partitioning the sequence dimension itself across multiple GPUs, enabling computation and memory scaling along the sequence axis.

The core idea involves organizing the training workflow such that each GPU processes a distinct shard of the sequence for each input sample. For a total sequence length NN and PP GPUs, each device operates on N/PN/P tokens. This design is agnostic to the underlying attention mechanism and can be applied to both dense and sparse attention operations, including modern optimized kernels such as FlashAttention v2.

2. Distributed Attention and Communication Analysis

Within Ulysses SP, attention computation requires that each GPU eventually gains access to the full sequence for a subset of attention heads, while still maintaining sequence partitioning for memory efficiency. This is achieved through the following distributed workflow:

  1. Each GPU computes Q,K,VQ, K, V projections for its local sequence slice.
  2. An all-to-all collective communication operation efficiently re-partitions these projections such that each device receives the full sequence for an assigned subset of heads.
  3. Local attention computations are performed independently per head across GPUs.
  4. Another all-to-all operation redistributes the output back to the original sequence split for subsequent feedforward and normalization layers.

The communication volume for each step is tightly bounded. For a transformer with sequence length NN, hidden dimension hh, and parallel degree PP, the per-layer total message size is $4Nh$. However, in all-to-all, each device sends and receives only $1/P$ of the total volume. Thus, the per-link communication complexity becomes:

Communication per GPU/link=4NhP\text{Communication per GPU/link} = \frac{4 N h}{P}

A salient property of Ulysses SP—demonstrated through theoretical and practical analysis—is that when both the sequence length NN and compute devices PP are increased proportionally, the per-link communication volume remains constant. This enables impressively scalable, high-throughput training for extreme sequence lengths. In contrast, prior sequence parallelism approaches, such as Megatron-LM SP, incur O(N)O(N) per-link communication, forming the principal bottleneck at scale.

3. Performance Characteristics and Benchmark Results

Ulysses Sequence Parallelism has been empirically evaluated on GPT-style models up to 1.2B parameters and sequence lengths of up to 1 million tokens. In benchmarking against established baselines, key performance results include:

  • Throughput: Ulysses SP trains models 2.5× faster than Megatron-LM Sequence Parallelism for identical sequence lengths, and supports 4× longer sequence lengths on the same hardware. For sparse attention, the throughput improvement is at least 2× for the same sequence length.
  • Scaling Efficiency: Per-GPU throughput remains above 54% of hardware theoretical peak at extreme scales. Both strong scaling (fixed sequence length, increased GPUs) and weak scaling (increase in both GPUs and sequence length) demonstrate near-linear scaling and high parallel efficiency.
  • Training Convergence: Models trained under Ulysses SP regime exhibit equivalent convergence rates and final quality compared to non-sequence-parallel baselines; for example, GPT-1.3B trained with 32K sequence length on 8 GPUs showed no degradation in convergence.
  • Extensibility: Integration with ZeRO-3 allows distributed parameter and optimizer state sharding, further enabling the simultaneous training of large models and long context windows.

4. Comparison with Other Sequence Parallel Methods

A comparative analysis highlights Ulysses SP’s computational and system advantages:

Method Comm. Complexity Attention Agnostic Memory Efficient Usability/Ease of Use
ColAI-SP O(N)O(N) No No Nontrivial
Megatron-SP O(N)O(N) Limited Limited Tied to Megatron-LM
Ulysses SP O(N/P)O(N/P) Yes Yes Easy

The all-to-all-based approach enables communication complexity to decrease with the number of devices, unlike prior art where communication overhead is invariant to parallel degree. In addition, Ulysses is agnostic to specific attention implementations, supporting both dense and sparse variants, and requires minimal code changes for integration.

5. Scalability, Practical Implications, and Supported Applications

The principal scaling feature of Ulysses SP is its constant per-GPU communication cost for a fixed ratio of sequence length to GPU count. Empirical validation extends up to 256 GPUs, encompassing both small- and large-scale cluster topologies. This architectural property makes it uniquely suitable for domains demanding long-context models, including:

  • Long document and book summarization
  • Conversational AI with extended context memory
  • Multimodal foundation models spanning disparate modalities, e.g., text–image–audio
  • Genomics (e.g., entire human genome spanning ~6 billion tokens)
  • Electronic health records and patient longitudinal data
  • High-resolution weather and climate modeling

Ulysses SP also provides system portability: implementation is attention-kernel agnostic, compatible with standard distributed infrastructures, and can often be adopted by applying simple wrappers to existing transformer layers.

6. Limitations and Evolution

While Ulysses SP offers a substantial improvement in scalability, there are limitations and areas where further innovation has emerged:

  • The degree of sequence parallelism is upper-bounded by the number of attention heads (due to splitting QKV along the head dimension). This constraint can be problematic in architectures adopting grouped or multi-query attention with few KV heads.
  • In heterogeneous or multi-dimensional sequence domains (e.g., video transformers with spatial–temporal grids, or highly variable sequence length workloads), follow-up work such as Dynamic Sequence Parallelism (DSP) (2403.10266) and heterogeneity-adaptive SP (2412.01523) have extended the static, single-dimension splitting of Ulysses to dynamic and workload-adaptive modes.
  • For post-training and fine-tuning scenarios, practical tooling (e.g., 360-LLaMA-Factory (2505.22296)) has introduced variants like Dummy-Head-Ulysses to sidestep head-count division constraints and to better support compatibility with arbitrary model architectures.

7. Impact and Role within the Sequence Parallelism Ecosystem

Ulysses Sequence Parallelism serves as a foundational innovation in the LLM systems literature, with subsequent advances generalizing or building upon its communication-scalable all-to-all scheme. It has shifted the paradigm from model-scaling to context-scaling, enabling not only longer contexts but also making these capabilities feasible outside of the largest industrial research labs.

Implementations incorporating or derived from Ulysses SP (DeepSpeed-Ulysses, Arctic Long Sequence Training (2506.13996), 360-LLaMA-Factory, and others) have become standard in modern high-efficiency LLM training pipelines, particularly for research, open-source development, and deployment of context-rich models in diverse domains.


Ulysses SP provides the key architectural and system framework for scalable long-sequence transformer training, overcoming both memory and interconnect bottlenecks inherent in prior approaches. As the field of large sequence models continues to mature, Ulysses SP and its communication-optimized derivatives remain central to sustaining further advances in long-context neural LLMing.