Arctic Long Sequence Training: Scalable And Efficient Training For Multi-Million Token Sequences (2506.13996v1)

Published 16 Jun 2025 in cs.LG

Abstract: Long sequences are critical for applications like RAG, long document summarization, multi-modality, etc., and modern LLMs, like Llama 4 Scout, support max sequence length of up to 10 million tokens. However, outside of enterprise labs, long sequence training is challenging for the AI community with limited system support in the open-source space. Out-of-box, even on a modern NVIDIA H100 80GB GPU cluster, training Llama 8B model with sequence over 32K runs out of memory on a basic Hugging Face (HF) model due to two reasons: i) LLM training workloads are not optimized to fully leverage a single GPU memory, ii) existing solutions for leveraging multiple GPU memory are not easily available to HF models, making long sequence training inaccessible. We address this with Arctic Long Sequence Training (ALST). It offers a combination of attention-agnostic single GPU and multi-GPU memory optimizations, that enables it to support out-of-box training of multi-million sequence length for a wide variety of HF models. ALST supports training Meta's Llama 8B model with 500K sequence length on a single H100 GPU, 3.7M on a single 8xH100 GPU node, and over 15M on a 4 node cluster, an increase of over 400x compared to the 32K baseline for the latter. ALST is fully compatible with HF models and open-sourced via Deepspeed https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-pallellism/ and Arctic Training https://github.com/snowflakedb/ArcticTraining/blob/main/projects/sequence-parallelism/README.md.

PDF Abstract

Arctic Long Sequence Training (ALST) is a set of techniques designed to enable efficient and scalable training of LLMs on multi-million token sequences using open-source frameworks like Hugging Face Transformers. The paper addresses the challenge that, despite recent LLMs supporting extremely long contexts (up to 10 million tokens), training them at these lengths is often inaccessible outside of large enterprise labs due to GPU memory limitations and lack of system support in open-source tools.

The core problem is memory exhaustion during training, particularly for activations. While weights, optimizer states, and gradients consume significant memory (e.g., 144GiB for Llama-3.1-8B with BF16 and Adam without offloading), activation memory grows with sequence length and becomes the primary bottleneck. Training Llama 8B with sequences longer than 32K tokens quickly leads to out-of-memory errors on standard hardware like NVIDIA H100 80GB GPUs.

ALST tackles this by combining three key memory optimization techniques:

Sequence Tiling for Memory Efficiency: This technique reduces the peak memory usage for operations that do not have dependencies across the sequence dimension (like logits, loss, and MLP computation). Instead of processing the entire sequence at once, computation is broken down into smaller tiles. Intermediate tensors are only materialized per tile, significantly lowering memory footprint. For instance, tiling the cross-entropy loss computation for Llama-3.1-8B with a 16K sequence length can save over 14GiB of memory. The paper introduces a generic TiledCompute autograd function and a specialized TiledMLP which can reduce MLP memory usage by approximately 10x for a single layer at very long sequence lengths.
Ulysses Sequence Parallelism (Ulysses SP) Compatible with Hugging Face Transformers: Adapted from Megatron-DeepSpeed's Ulysses, this technique allows leveraging the aggregate GPU memory across multiple devices for activations. It works by splitting the sequence across GPUs initially. At the attention layer, an all-to-all communication re-arranges data such that each GPU receives the full sequence but only for a subset of attention heads (Attention Head Parallelism). After attention, another all-to-all switches back to the original sequence-parallel layout. A key advantage is its attention mechanism agnosticism; it can wrap existing implementations like FlashAttention2 or SDPA. The authors extended the original Ulysses SP to support modern attention types (MHA, GQA, MQA) prevalent in Hugging Face models, handling cases where the number of query or key/value heads is not perfectly divisible by the Sequence Parallelism degree.
Activation Offloading and Other PyTorch Optimizations:
- Non-invasive: The authors identified significant memory overheads from specific PyTorch versions (standardizing on 2.7.1 or nightly), collective communication methods (all_reduce_object vs all_reduce), and memory fragmentation (using PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True). Activation checkpointing was also used to trade compute for memory.
- Invasive: A crucial optimization is offloading the checkpointed hidden_states tensors to CPU memory during training. This dramatically reduces peak GPU memory usage by removing the dependency of peak memory on the number of model layers, allowing much longer sequence lengths. This is particularly impactful for models with many layers. However, it can shift the bottleneck to CPU memory, especially for large models and sequence lengths.

The paper also addresses practical implementation challenges for Hugging Face Transformers:

Attention Mask: For very long sequences, the traditional 4D causal attention mask (shape [bs, seqlen, seqlen]) becomes prohibitively large (e.g., 116GiB at 250K sequence length). ALST uses position_ids (shape [bs, seqlen]) instead and requires monkey-patching the model's _update_causal_mask function to prevent the large mask creation.
HF Integration: ALST integrates into Hugging Face Transformers by:
- Injecting a custom wrapper (UlyssesSPAttentionHF) into the attention mechanism via transformers.modeling_utils.ALL_ATTENTION_FUNCTIONS.
- Introducing a UlyssesSPDataLoaderAdapter to efficiently shard large sequences from any existing DataLoader.
- Processing non-attention layers independently per GPU on its sequence shard.
Loss Sharding: Causal LLMs require labels shifted by one position for next-token prediction. When sharding sequences for SP, naive shifting per shard leads to losing the first label of each shard. The authors modified the Hugging Face causal loss API to accept pre-shifted labels, correctly handling loss computation across shards.

The evaluation demonstrates the effectiveness of ALST. Using Llama-8B on H100 GPUs with DeepSpeed ZeRO Stage 3 and optimizer states offload to CPU as a baseline (which supports 32K sequence length), ALST achieves significant improvements:

1 GPU: 500K sequence length (16x improvement). This configuration requires weights offload to CPU in addition to optimizer states offload to fit the model.
8 GPUs (1 node): 3.7M sequence length (116x improvement).
32 GPUs (4 nodes): 15M sequence length (469x improvement).

Evaluations with Llama-3.1-70B and Qwen/Qwen3-32B also show multi-million sequence length capabilities, demonstrating roughly linear scaling of sequence length with the number of GPUs. For these larger models, the bottleneck at very high sequence lengths on 4 and 8 nodes became the available CPU memory for activation checkpoint offloading.

Feature ablations confirm the contribution of each ALST component. While Tiled Logits/Loss and Ulysses SP provide initial gains, Tiled MLP and particularly Activation Checkpoint Offload to CPU enable the largest sequence lengths. Tiled MLP becomes increasingly important at sequence lengths over 5M where hidden states become very large.

Training correctness validation showed that ALST achieves nearly identical loss curves compared to the 32K baseline, confirming the preservation of training quality.

Users can try ALST using the open-sourced implementation within the ArcticTraining framework [arctictraining] and DeepSpeed [deepspeed], specifically via the tutorials provided on the DeepSpeed website. ArcticTraining includes pre-configured recipes for long-sequence post-training.

Limitations and Notes:

Current maximum Sequence Parallelism degree is limited by the number of query heads.
The number of query/key/value heads must be divisible by the SP degree for balanced computation.
Training with very long sequences requires a dataset with samples of corresponding length; packing short samples into a long sequence without respecting sample boundaries during training will not teach the model to handle long contexts during inference.
FlashAttention2 handles packed samples with position ids correctly, while SDPA in Hugging Face might attend to the entire packed sequence if not using actual long samples.

Future work aims to improve performance, remove current limitations on SP degree and head count divisibility, and integrate ALST more broadly into frameworks like Hugging Face Accelerate and Trainer.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Stas Bekman (7 papers)
Samyam Rajbhandari (21 papers)
Michael Wyatt (6 papers)
Jeff Rasley (10 papers)
Tunji Ruwase (1 paper)
Zhewei Yao (64 papers)
Aurick Qiao (9 papers)
Yuxiong He (59 papers)

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/StasBekman/status/1937563125659893900

https://twitter.com/StasBekman/status/1938270423978021291

https://twitter.com/StasBekman/status/1937555871044895114