DeepSpeed-Ulysses Sequence Parallelism

Updated 17 December 2025

DeepSpeed-Ulysses Sequence Parallelism is a method that shards long input sequences across GPUs to enable full-attention training on contexts up to millions of tokens.
It leverages sequence tiling, activation checkpoint offload, and two-phase all-to-all communication to reduce per-GPU memory usage while maintaining high throughput.
Empirical results demonstrate over 400× longer context scaling for models like Llama-8B, with applications spanning NLP and high-resolution vision tasks.

DeepSpeed-Ulysses Sequence Parallelism denotes a collection of system-level, memory- and communication-efficient techniques for training transformer models on extremely long sequence lengths—ranging from hundreds of thousands up to multi-million-token contexts—by sharding the sequence dimension across multiple GPUs. The approach relies on all-to-all collectives to redistribute attention-related activations among devices, unlocking an $O(L/P)$ per-device memory scaling, where $L$ is sequence length and $P$ the number of devices. Ulysses Sequence Parallelism integrates seamlessly with modern Hugging Face Transformers, ZeRO optimizer parameter partitioning, and high-efficiency attention kernels such as FlashAttention 2. Its flagship instantiation in Arctic Long Sequence Training (ALST) demonstrates over 400-fold improvement in trainable context length for Llama-8B models running on vanilla H100 clusters and generalizes to scientific vision transformers and unified multi-parallel setups (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).

1. Core Algorithm and Sequence Sharding

The central principle of DeepSpeed-Ulysses Sequence Parallelism is sharding the input sequence along its length such that each of $G$ GPUs receives a local segment of the batch, $[1, L/G, H]$ , with $L$ the global sequence length, and $H$ the hidden size. All non-attention network layers (embeddings, layernorm, feed-forward) operate independently on their local $L/G$ fragment. When processing a self-attention layer, global sequence context is required because each position may attend to all others. Ulysses SP implements a two-phase all-to-all procedure:

During "pre-attention redistribution", each GPU sends its local QKV ( $[1, L/G, H]$ ) and receives the entire sequence ( $L$ tokens) but only a $1/G$ fraction of the heads. This switches computation into a "head-parallel" layout.
Self-attention computation, typically via fused kernels like FlashAttention2, is executed locally over the full sequence, but only for the GPU’s assigned head subset.
A "post-attention restoration" all-to-all reverses the transformation, returning activations to the original "sequence-parallel" layout: each GPU resumes working on its original sequence segment, but all heads.

This pattern repeats across each transformer block. In between attention layers, computation remains in the sequence-sharded layout, while attention layers require dynamic repartitioning (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).

Integration into popular frameworks is explicit: in Hugging Face, the UlyssesSPAttentionHF class is injected as a valid backend, and sequence sharding is orchestrated via a specialized DataLoader adapter. Loss computation is also sharded on a per-segment basis to maintain memory efficiency (Bekman et al., 16 Jun 2025).

2. Memory Scaling and Optimization Strategies

DeepSpeed-Ulysses unlocks long-sequence training by reducing per-GPU memory pressure in two ways:

Single-GPU Memory Optimization

Sequence Tiling: Any layer operating independently per token (e.g., MLPs, embeddings, logits/loss) is split into $T$ tiles along the sequence. The memory footprint becomes $O(t \cdot H)$ per tile ( $t = L/T$ ), instead of $O(L \cdot H)$ globally.
Activation Checkpoint Offload: Standard checkpointing retains $[1, L, H]$ for each layer. Ulysses monkey-patches checkpointing to offload activations to CPU after the forward pass and asynchronously retrieve for the backward, eliminating the $O(NLH)$ "memory hill" on GPU (Bekman et al., 16 Jun 2025).

Multi-GPU Sequence Parallelism

Outside Attention: Each GPU holds only $1/G$ of activations, for $N$ layers: $N (L/G) H s$ (with $s$ bytes per element).
During Attention: Full sequence $L$ is present but only a local subset of heads.
Peak Per-GPU Activation Memory:

$M_{\text{multi\_peak}} \approx \max\Big[ N\frac{L}{G} H s,\, L\frac{H}{G_h} s + \text{overhead} \Big ]$

(where $G_h$ is the head-parallel degree, typically $G$ ) (Bekman et al., 16 Jun 2025).

Table: Peak sequence length handled at various GPU scales for Llama-8B (bf16, batch size 1) (Bekman et al., 16 Jun 2025):

Hardware	Baseline HF+ZeRO3 $L_{\max}$	ALST $L_{\max}$	$\times$ Speedup	Iteration Time (ALST)	TFLOPS (ALST)
1×H100	32K	500K	15.6	16m 50s	548.1
8×H100	32K	3.7M	116	1h 47m 35s	590.6
32×H100	32K	15M	469	7h 25m 09s	590.6

Empirically, this enables up to $>400\times$ longer context windows without exceeding GPU memory (Bekman et al., 16 Jun 2025).

3. Communication Patterns and Scaling Characteristics

Each attention layer introduces two bandwidth-intensive all-to-all operations per forward and backward pass:

All-to-all volume per layer: Each operation moves $L \cdot H \cdot s$ -sized tensors. Total per-layer communication per GPU is $2(LH/G)s$; across $N$ layers, $2N(LH/G)s$.
Closed-Form Total Communication (per block, all devices):

$C(P, L, H) = 2\alpha H^2 + 8 b L H$

where $P$ is number of devices, $b$ microbatch size, and typically $\alpha=12$ for transformer blocks (Fang et al., 13 May 2024).

Importantly, if $L \propto P$ , then per-device activation communication remains constant as the system is scaled out—enabling arbitrarily long sequence lengths limited only by available networking bandwidth (Jacobs et al., 2023, Fang et al., 13 May 2024). This scaling property is not present in all-gather/reduce methods (e.g., Megatron-LM) whose comm volume grows linearly with $L$ per device (Jacobs et al., 2023). Fast all-to-all fabrics (NVLink, Infinity Fabric) are required to avoid communication bottlenecks.

4. Integration with Hybrid Parallelism and Best Practices

DeepSpeed-Ulysses is fully compatible with other forms of parallelism:

ZeRO Stages 1–3: Parameter, gradient, and optimizer state partitioning further reduce per-device memory (Bekman et al., 16 Jun 2025, Fang et al., 13 May 2024).
Pipeline Parallelism (PP): Multiple transformer stages across GPUs. Essential for small-head-count models because Ulysses requires that the number of sequence-parallel ranks does not exceed the number of attention heads (Tsaris et al., 17 Apr 2024).
Tensor Parallelism (TP): Compatible, though “Unified SP” found in (Fang et al., 13 May 2024) shows Ulysses and Ring-Attention are orthogonal and can be composed arbitrarily, forming hybrid 4D meshes (Tensor, Ulysses, Ring, Data/Pipeline parallelism).

Best-practice recommendations include:

Prefer Unified SP (Ulysses+Ring) over pure Ulysses or Ring for network adaptability and head-count flexibility.
Use data parallelism first; resort to SP if the batch is too small.
Deploy ZeRO-1/2/3 for model and optimizer state memory reduction.
For grouped query/multi-query attention (GQA/MQA), SP’s activation communication costs shrink proportionally.
To overcome Ulysses’ $\leq\ \text{num\_heads}$ rank cap, increase the Ring degree, distributing sequence further (Fang et al., 13 May 2024).

5. Empirical Results and Domain Applications

DeepSpeed-Ulysses-based approaches report large gains in practical scalability and throughput compared to alternatives:

Transformer LLMs: Training Llama-8B with up to 15M tokens context length across 32 H100 GPUs, maintaining TFLOPS near device peak, and achieving $\times$ 469 improvement in maximum sequence length over the HF+ZeRO3 baseline (Bekman et al., 16 Jun 2025).
Vision Transformers (ViTs): Training full-attention ViTs at 188K–1M tokens on climate datasets over as many as 2,048 MI250X GPUs with 94% weak scaling efficiency and $>65\%$ GPU utilization; delivered a 20% improvement in downstream prediction accuracy (Tsaris et al., 17 Apr 2024).
MFU Achievements: On 2×8×A800, LLAMA3-8B at $L=208$ K, Ulysses achieves 0.86 MFU ($269.8$ TFLOPS/GPU); up to 0.92 MFU at $L=160$ K in Unified SP, 10–15% faster than Ring-only mode (Fang et al., 13 May 2024).

Sequence-parallel methods are responsible for extending full-attention training to ultra-long contexts in RAG, multi-document summarization, scientific modeling, and high-resolution vision domains, surpassing the memory and compute bottlenecks of previous sparse or window-based approaches.

6. Implementation Considerations and Code Integration

Activation of DeepSpeed-Ulysses in practice involves:

DeepSpeed API: deepspeed.initialize accepts a sequence_parallelism config with degree $G$ , tile size $t$ , and an activation offload flag.
Hugging Face Transformers:
- Inject UlyssesSPAttentionHF via attn_implementation='ulysses'.
- Wrap DataLoader with UlyssesSPDataLoaderAdapter for sequence sharding.
- Patch label shifting for correct masked loss after sharding.
- Set environment variables like PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
Pseudocode:

model = AutoModelForCausalLM.from_pretrained(
    'meta-llama/Llama-8B-Instruct',
    attn_implementation='ulysses'
)

For each transformer block, a typical per-rank pattern is:

# Pre-attention
hidden_states_r = LayerNorm(hidden_states_r)
hidden_states_r = TiledMLP(hidden_states_r, tile_size=t)
# QKV projection and All-to-All
q_r, k_r, v_r = linear_qkv(hidden_states_r)
q_hp, k_hp, v_hp = all_to_all_head_parallel(q_r, k_r, v_r)
attn_out_hp = SelfAttention(q_hp, k_hp, v_hp)
attn_out_r = all_to_all_seq_parallel(attn_out_hp)
hidden_states_r = hidden_states_r + attn_out_r
hidden_states_r = LayerNorm(hidden_states_r)
hidden_states_r = TiledMLP(hidden_states_r, tile_size=t)

(Bekman et al., 16 Jun 2025)

7. Limitations and Trade-offs

DeepSpeed-Ulysses’ addressable sequence length is fundamentally capped by the interplay of GPU memory, attention kernel efficiency, and the number of available attention heads (each rank must process at least one head in pure Ulysses SP, mitigated in hybrid Unified SP). High-bandwidth, low-latency all-to-all communication hardware is essential to avoid communication bottlenecks, especially as $L$ increases. With small head counts and massive $L$ , pipeline parallelism or hybrid Ulysses–Ring SP is necessary to scale sequence length further. These approaches are orthogonal to data parallelism, tensor model parallelism, and optimizer-state sharding, and can be composed for maximal memory and compute efficiency (Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).

DeepSpeed-Ulysses Sequence Parallelism, as adopted in ALST and Unified SP frameworks, represents the current apex of efficient, scalable full-attention training on ultra-long sequences, and generalizes to a wide array of transformer architectures and scientific domains (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).