DeepSpeed-Ulysses Sequence Parallelism
- DeepSpeed-Ulysses Sequence Parallelism is a method that shards long input sequences across GPUs to enable full-attention training on contexts up to millions of tokens.
- It leverages sequence tiling, activation checkpoint offload, and two-phase all-to-all communication to reduce per-GPU memory usage while maintaining high throughput.
- Empirical results demonstrate over 400× longer context scaling for models like Llama-8B, with applications spanning NLP and high-resolution vision tasks.
DeepSpeed-Ulysses Sequence Parallelism denotes a collection of system-level, memory- and communication-efficient techniques for training transformer models on extremely long sequence lengths—ranging from hundreds of thousands up to multi-million-token contexts—by sharding the sequence dimension across multiple GPUs. The approach relies on all-to-all collectives to redistribute attention-related activations among devices, unlocking an per-device memory scaling, where is sequence length and the number of devices. Ulysses Sequence Parallelism integrates seamlessly with modern Hugging Face Transformers, ZeRO optimizer parameter partitioning, and high-efficiency attention kernels such as FlashAttention 2. Its flagship instantiation in Arctic Long Sequence Training (ALST) demonstrates over 400-fold improvement in trainable context length for Llama-8B models running on vanilla H100 clusters and generalizes to scientific vision transformers and unified multi-parallel setups (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).
1. Core Algorithm and Sequence Sharding
The central principle of DeepSpeed-Ulysses Sequence Parallelism is sharding the input sequence along its length such that each of GPUs receives a local segment of the batch, , with the global sequence length, and the hidden size. All non-attention network layers (embeddings, layernorm, feed-forward) operate independently on their local fragment. When processing a self-attention layer, global sequence context is required because each position may attend to all others. Ulysses SP implements a two-phase all-to-all procedure:
- During "pre-attention redistribution", each GPU sends its local QKV () and receives the entire sequence ( tokens) but only a $1/G$ fraction of the heads. This switches computation into a "head-parallel" layout.
- Self-attention computation, typically via fused kernels like FlashAttention2, is executed locally over the full sequence, but only for the GPU’s assigned head subset.
- A "post-attention restoration" all-to-all reverses the transformation, returning activations to the original "sequence-parallel" layout: each GPU resumes working on its original sequence segment, but all heads.
This pattern repeats across each transformer block. In between attention layers, computation remains in the sequence-sharded layout, while attention layers require dynamic repartitioning (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).
Integration into popular frameworks is explicit: in Hugging Face, the UlyssesSPAttentionHF class is injected as a valid backend, and sequence sharding is orchestrated via a specialized DataLoader adapter. Loss computation is also sharded on a per-segment basis to maintain memory efficiency (Bekman et al., 16 Jun 2025).
2. Memory Scaling and Optimization Strategies
DeepSpeed-Ulysses unlocks long-sequence training by reducing per-GPU memory pressure in two ways:
Single-GPU Memory Optimization
- Sequence Tiling: Any layer operating independently per token (e.g., MLPs, embeddings, logits/loss) is split into tiles along the sequence. The memory footprint becomes per tile (), instead of globally.
- Activation Checkpoint Offload: Standard checkpointing retains for each layer. Ulysses monkey-patches checkpointing to offload activations to CPU after the forward pass and asynchronously retrieve for the backward, eliminating the "memory hill" on GPU (Bekman et al., 16 Jun 2025).
Multi-GPU Sequence Parallelism
- Outside Attention: Each GPU holds only $1/G$ of activations, for layers: (with bytes per element).
- During Attention: Full sequence is present but only a local subset of heads.
- Peak Per-GPU Activation Memory:
(where is the head-parallel degree, typically ) (Bekman et al., 16 Jun 2025).
Table: Peak sequence length handled at various GPU scales for Llama-8B (bf16, batch size 1) (Bekman et al., 16 Jun 2025):
| Hardware | Baseline HF+ZeRO3 | ALST | Speedup | Iteration Time (ALST) | TFLOPS (ALST) |
|---|---|---|---|---|---|
| 1×H100 | 32K | 500K | 15.6 | 16m 50s | 548.1 |
| 8×H100 | 32K | 3.7M | 116 | 1h 47m 35s | 590.6 |
| 32×H100 | 32K | 15M | 469 | 7h 25m 09s | 590.6 |
Empirically, this enables up to longer context windows without exceeding GPU memory (Bekman et al., 16 Jun 2025).
3. Communication Patterns and Scaling Characteristics
Each attention layer introduces two bandwidth-intensive all-to-all operations per forward and backward pass:
- All-to-all volume per layer: Each operation moves -sized tensors. Total per-layer communication per GPU is $2(LH/G)s$; across layers, $2N(LH/G)s$.
- Closed-Form Total Communication (per block, all devices):
where is number of devices, microbatch size, and typically for transformer blocks (Fang et al., 13 May 2024).
Importantly, if , then per-device activation communication remains constant as the system is scaled out—enabling arbitrarily long sequence lengths limited only by available networking bandwidth (Jacobs et al., 2023, Fang et al., 13 May 2024). This scaling property is not present in all-gather/reduce methods (e.g., Megatron-LM) whose comm volume grows linearly with per device (Jacobs et al., 2023). Fast all-to-all fabrics (NVLink, Infinity Fabric) are required to avoid communication bottlenecks.
4. Integration with Hybrid Parallelism and Best Practices
DeepSpeed-Ulysses is fully compatible with other forms of parallelism:
- ZeRO Stages 1–3: Parameter, gradient, and optimizer state partitioning further reduce per-device memory (Bekman et al., 16 Jun 2025, Fang et al., 13 May 2024).
- Pipeline Parallelism (PP): Multiple transformer stages across GPUs. Essential for small-head-count models because Ulysses requires that the number of sequence-parallel ranks does not exceed the number of attention heads (Tsaris et al., 17 Apr 2024).
- Tensor Parallelism (TP): Compatible, though “Unified SP” found in (Fang et al., 13 May 2024) shows Ulysses and Ring-Attention are orthogonal and can be composed arbitrarily, forming hybrid 4D meshes (Tensor, Ulysses, Ring, Data/Pipeline parallelism).
Best-practice recommendations include:
- Prefer Unified SP (Ulysses+Ring) over pure Ulysses or Ring for network adaptability and head-count flexibility.
- Use data parallelism first; resort to SP if the batch is too small.
- Deploy ZeRO-1/2/3 for model and optimizer state memory reduction.
- For grouped query/multi-query attention (GQA/MQA), SP’s activation communication costs shrink proportionally.
- To overcome Ulysses’ rank cap, increase the Ring degree, distributing sequence further (Fang et al., 13 May 2024).
5. Empirical Results and Domain Applications
DeepSpeed-Ulysses-based approaches report large gains in practical scalability and throughput compared to alternatives:
- Transformer LLMs: Training Llama-8B with up to 15M tokens context length across 32 H100 GPUs, maintaining TFLOPS near device peak, and achieving 469 improvement in maximum sequence length over the HF+ZeRO3 baseline (Bekman et al., 16 Jun 2025).
- Vision Transformers (ViTs): Training full-attention ViTs at 188K–1M tokens on climate datasets over as many as 2,048 MI250X GPUs with 94% weak scaling efficiency and GPU utilization; delivered a 20% improvement in downstream prediction accuracy (Tsaris et al., 17 Apr 2024).
- MFU Achievements: On 2×8×A800, LLAMA3-8B at K, Ulysses achieves 0.86 MFU ($269.8$ TFLOPS/GPU); up to 0.92 MFU at K in Unified SP, 10–15% faster than Ring-only mode (Fang et al., 13 May 2024).
Sequence-parallel methods are responsible for extending full-attention training to ultra-long contexts in RAG, multi-document summarization, scientific modeling, and high-resolution vision domains, surpassing the memory and compute bottlenecks of previous sparse or window-based approaches.
6. Implementation Considerations and Code Integration
Activation of DeepSpeed-Ulysses in practice involves:
- DeepSpeed API:
deepspeed.initializeaccepts asequence_parallelismconfig with degree , tile size , and an activation offload flag. - Hugging Face Transformers:
- Inject UlyssesSPAttentionHF via
attn_implementation='ulysses'. - Wrap DataLoader with
UlyssesSPDataLoaderAdapterfor sequence sharding. - Patch label shifting for correct masked loss after sharding.
- Set environment variables like
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True.
- Inject UlyssesSPAttentionHF via
- Pseudocode:
1 2 3 4 |
model = AutoModelForCausalLM.from_pretrained(
'meta-llama/Llama-8B-Instruct',
attn_implementation='ulysses'
) |
1 2 3 4 5 6 7 8 9 10 11 |
# Pre-attention hidden_states_r = LayerNorm(hidden_states_r) hidden_states_r = TiledMLP(hidden_states_r, tile_size=t) # QKV projection and All-to-All q_r, k_r, v_r = linear_qkv(hidden_states_r) q_hp, k_hp, v_hp = all_to_all_head_parallel(q_r, k_r, v_r) attn_out_hp = SelfAttention(q_hp, k_hp, v_hp) attn_out_r = all_to_all_seq_parallel(attn_out_hp) hidden_states_r = hidden_states_r + attn_out_r hidden_states_r = LayerNorm(hidden_states_r) hidden_states_r = TiledMLP(hidden_states_r, tile_size=t) |
7. Limitations and Trade-offs
DeepSpeed-Ulysses’ addressable sequence length is fundamentally capped by the interplay of GPU memory, attention kernel efficiency, and the number of available attention heads (each rank must process at least one head in pure Ulysses SP, mitigated in hybrid Unified SP). High-bandwidth, low-latency all-to-all communication hardware is essential to avoid communication bottlenecks, especially as increases. With small head counts and massive , pipeline parallelism or hybrid Ulysses–Ring SP is necessary to scale sequence length further. These approaches are orthogonal to data parallelism, tensor model parallelism, and optimizer-state sharding, and can be composed for maximal memory and compute efficiency (Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).
DeepSpeed-Ulysses Sequence Parallelism, as adopted in ALST and Unified SP frameworks, represents the current apex of efficient, scalable full-attention training on ultra-long sequences, and generalizes to a wide array of transformer architectures and scientific domains (Bekman et al., 16 Jun 2025, Jacobs et al., 2023, Fang et al., 13 May 2024, Tsaris et al., 17 Apr 2024).