DeepSpeed-Ulysses: LLM Long-Sequence Training

Updated 3 September 2025

DeepSpeed-Ulysses is a scalable framework that partitions input sequences and employs head-parallel attention, enabling efficient long-sequence training for transformer models.
It redistributes query, key, and value tensors through all-to-all communications to support both dense and sparse attention mechanisms, optimizing memory and compute usage.
The system achieves higher throughput and remarkable scaling improvements, inspiring extensions like ALST to train models with millions of tokens across multi-GPU setups.

The DeepSpeed-Ulysses system is a system-level optimization framework for transformer-based LLM training, designed specifically to enable efficient, scalable training with extremely long input sequences—beyond the limitations of conventional data, tensor, and pipeline parallelism. DeepSpeed-Ulysses is architected around sequence parallelism, deploys a head-parallel all-to-all strategy for the attention computation, and is attention-agnostic, integrating harmoniously with both dense and sparse attention mechanisms and optimization frameworks like ZeRO. Its system design has inspired a new generation of long-sequence training frameworks and is a central building block for recent extensions such as ALST.

1. System Architecture and Methodological Foundations

DeepSpeed-Ulysses partitions each input sample along the sequence dimension across $P$ GPUs. For a sequence of length $N$ , each GPU receives a partition of size $N/P$ . This partitioning is maintained through the projection into query (Q), key (K), and value (V) embeddings. Immediately preceding the multi-head attention computation, DeepSpeed-Ulysses performs an efficient all-to-all collective communication in which the partitioned $Q,K,V$ are redistributed such that every GPU holds the complete sequence, but only for a non-overlapping subset of attention heads. Consequently, the attention calculation is performed in a head-parallel manner—with each GPU computing the full attention for its assigned heads.

After the attention computation, a second all-to-all collectively exchanges the context tensor, re-partitioning along the sequence dimension and restoring the layout necessary for subsequent layers such as MLP, layer normalization, or additional transformer blocks. This zig-zag partitioning—along the sequence for non-attention layers and by head for attention—enables memory and compute scaling with increasing sequence length and compute devices. The system design is compatible with both dense and sparse attention kernels (e.g., FlashAttention2 and SDPA variants).

A typical workflow per transformer layer comprises:

Sequence-parallel projection: Partition input tensor along the sequence.
All-to-all collective (pre-attention): Exchanging $Q,K,V$ tensors among devices for head-parallel attention.
Head-parallel attention computation: Each GPU computes attention for its assigned heads over the full (reassembled) sequence, applying the standard attention equation:

$\text{Attention}(Q, K, V) = \text{Softmax}\left(\frac{QK^\top}{\sqrt{d}}\right)V$

All-to-all collective (post-attention): Redistribute resulting context tensor to sequence-aligned partitions.
Downstream processing: Proceed with layer normalization, MLP, etc., using the newly re-partitioned activations.

This strategy is generalizable and attention-mechanism-agnostic, making it adaptable to new attention architectures and hybrid parallel training stacks (Jacobs et al., 2023, Fang et al., 13 May 2024, Bekman et al., 16 Jun 2025).

2. Communication and Memory Scalability

DeepSpeed-Ulysses achieves a communication model that provides constant communication volume per link as sequence length $N$ and device count $P$ are scaled proportionally. The pre-attention all-to-all transmits $3Nh$ elements (for Q, K, V), and the post-attention exchange sends $Nh$ more, yielding a total per-link communication cost per layer of:

$\text{Total communication/link} = \frac{3Nh + Nh}{P} = \frac{4Nh}{P}$

This per-link volume remains constant when $N \propto P$ , in contrast to approaches such as Megatron-LM, where communication overhead grows linearly with $N$ regardless of device scaling.

For memory optimization, partitioning the activations along the sequence dimension ensures that each device only stores a fraction of the total activation tensor ( $\mathcal{O}(bs \cdot L \cdot d / P)$ ), directly alleviating per-device memory bottlenecks inherent in long-sequence processing. The system avoids redundant storage, and its all-to-all exchanges are designed to match the underlying high-bandwidth interconnect topologies typical in contemporary clusters.

In later developments such as Arctic Long Sequence Training (ALST), this foundational sequence-parallel, attention-head-sharded approach is further augmented via sequence tiling (sharding large tensors, especially logits and MLP outputs, into tiles), PyTorch memory fragmentation optimizations, and activation checkpoint offloading (Bekman et al., 16 Jun 2025).

3. Performance Analysis

Empirical results demonstrate that DeepSpeed-Ulysses achieves up to a 2.5× training throughput improvement over previous state-of-the-art baselines, and is able to train with sequence lengths up to 4× longer. On dense attention, throughput remains consistently higher for both 7B and 30B parameter models, with competitive TFLOPS utilization (175 TFLOPS/GPU, approximately 54% of theoretical peak). For sparse attention, throughput gains over Megatron-LM exceed 2×.

ALST, a system built atop DeepSpeed-Ulysses, allows models such as Llama-8B to be trained with 500K tokens on a single H100 80GB GPU, and up to 15M tokens on a 4-node (32 GPU) cluster—hence providing over 400× sequence length scaling relative to open-source Hugging Face Transformers baselines. Table-based results in ALST show that tiled computation and sequence parallelism lower iteration times and maintain high TFLOPS efficiency even as sequence lengths increase by two orders of magnitude (Bekman et al., 16 Jun 2025).

Hardware	Baseline Max Seq Len	ALST Max Seq Len	Iter Time (M tokens)
Single H100 GPU	32K	500K	~1:47:35 (3.7M)
8×H100 Node	32K	3.7M	—
32×H100 (4 nodes)	32K	15M	—

4. Extensions, Best Practices, and Integrations

Subsequent research synthesizes DeepSpeed-Ulysses with additional parallelism forms and memory management techniques. Unified approaches, such as that in (Fang et al., 13 May 2024), combine sequence parallelism (Ulysses and Ring-based variants) with data, tensor, and pipeline parallelism into a “hybrid 4D parallelism” strategy, arranging devices as a 2D mesh to balance All2All and peer-to-peer communications. Load-balance reordering and dynamic token remapping are introduced to address causal mask–induced computational imbalances.

ALST integrates DeepSpeed-Ulysses with Hugging Face Transformers via UlyssesSPAttentionHF, provides specialized DataLoader adapters (UlyssesSPDataLoaderAdapter), and mitigates causal loss misalignment by supporting pre-shifted labels. DeepSpeed ZeRO Stage 3 weight sharding further minimizes model state footprint, making very large-scale long-sequence training feasible using widely adopted frameworks (Bekman et al., 16 Jun 2025).

Other best practices include:

Using grouped query attention (GQA) to reduce communication cost by a factor of $1/G$.
Exploiting activation checkpointing with offloading for further memory reduction, and leveraging PyTorch memory allocator improvements.
Hybridizing Ulysses with context-parallelism strategies such as 2D- and double-ring communication (as in LoongTrain (Gu et al., 26 Jun 2024)) to break head-count scalability limits.

5. Comparative Systems and Limitations

Compared to contemporary approaches (Megatron-LM, ColAI-SP, Ring-Attention, LoongTrain, SPPO), DeepSpeed-Ulysses offers unique head-parallel, attention-agnostic scalability, with constant per-link communication under proportional scaling. However, scaling degree is presently capped by the number of attention heads, and, in some scenarios (e.g., extremely high GQA ratios or when communication topologies are poorly optimized), other methods such as LoongTrain's 2D-Attention (head- and context-parallel) or double-ring communication may provide higher Model FLOPs Utilization (MFU).

SPPO introduces adaptive offloading and pipeline scheduling, achieving as much as 3.38× throughput improvement by minimizing pipeline bubbles and optimizing offload overlap, offering a competitive pathway for memory-bound long-sequence regimes (Chen et al., 13 Mar 2025).

ALST, as an extension of DeepSpeed-Ulysses, pushes practical sequence length boundaries—by 16× on single GPUs and nearly 470× on multi-node clusters versus Hugging Face baselines. The current limit on sequence parallelism degree—equal to the number of query heads—is an open area for further research, as is the overlap of offloading and computation (Bekman et al., 16 Jun 2025).

6. Scientific Applications and Broader Impact

DeepSpeed-Ulysses substantially expands the tractable domain for LLM training on tasks requiring long contexts, including retrieval-augmented generation, long document summarization, multi-turn conversation, and large-scale scientific modeling. The system is demonstrated in genomics, where genome-scale foundation models benefit from long-sequence support, and structural biology, where fused kernels and tiling allow efficient training of Evoformer-centric models such as OpenFold. The system's architecture underpins key progress in the DeepSpeed4Science initiative, which adapts and extends sequence-parallel and memory-optimization strategies to diverse scientific applications (e.g., genomics, drug discovery, climate, and catalyst modeling) (Song et al., 2023).

The ability to train LLMs or foundation models with multi-million token sequences without prohibitive memory requirements or excessive hardware, as enabled by DeepSpeed-Ulysses and its derivatives, allows researchers and practitioners outside large industrial labs to push the limits of long-context modeling.

7. Future Directions

Planned and suggested improvements to DeepSpeed-Ulysses and its ecosystem include:

Relaxed Sequence Parallelism Limits: Lifting the constraint that the number of parallel devices is capped by the number of attention heads.
Optimized Activation Offloading: Employing asynchronous transfer mechanisms (e.g., CUDA streams) to eliminate communication–computation stalls during checkpoint offload/reload.
Deeper Integration: Embedding support natively within Hugging Face Accelerate, the Transformers Trainer API, and similar libraries for broader adoption.
Beyond Matrix Multiplications: Optimizing computational and communication components beyond GEMM-dominated flows, including loss computation, data loading, and non-GPU overheads.

These directions are aimed at closing remaining efficiency gaps and further democratizing long-sequence model training in both open-source and scientific research contexts (Bekman et al., 16 Jun 2025).

DeepSpeed-Ulysses represents a key development in the long-sequence transformer systems landscape, providing an extensible and scalable template for efficient long-context generative modeling and domain-specific foundation models. Its design principles—sequence-dimension partitioning, head-parallel attention computation, and scalable, attention-agnostic communication—form the core of current and emerging frameworks for extreme sequence LLM training.