Arctic Long Sequence Training (ALST) is a set of techniques designed to enable efficient and scalable training of LLMs on multi-million token sequences using open-source frameworks like Hugging Face Transformers. The paper addresses the challenge that, despite recent LLMs supporting extremely long contexts (up to 10 million tokens), training them at these lengths is often inaccessible outside of large enterprise labs due to GPU memory limitations and lack of system support in open-source tools.
The core problem is memory exhaustion during training, particularly for activations. While weights, optimizer states, and gradients consume significant memory (e.g., 144GiB for Llama-3.1-8B with BF16 and Adam without offloading), activation memory grows with sequence length and becomes the primary bottleneck. Training Llama 8B with sequences longer than 32K tokens quickly leads to out-of-memory errors on standard hardware like NVIDIA H100 80GB GPUs.
ALST tackles this by combining three key memory optimization techniques:
- Sequence Tiling for Memory Efficiency: This technique reduces the peak memory usage for operations that do not have dependencies across the sequence dimension (like logits, loss, and MLP computation). Instead of processing the entire sequence at once, computation is broken down into smaller tiles. Intermediate tensors are only materialized per tile, significantly lowering memory footprint. For instance, tiling the cross-entropy loss computation for Llama-3.1-8B with a 16K sequence length can save over 14GiB of memory. The paper introduces a generic
TiledCompute
autograd function and a specializedTiledMLP
which can reduce MLP memory usage by approximately 10x for a single layer at very long sequence lengths. - Ulysses Sequence Parallelism (Ulysses SP) Compatible with Hugging Face Transformers: Adapted from Megatron-DeepSpeed's Ulysses, this technique allows leveraging the aggregate GPU memory across multiple devices for activations. It works by splitting the sequence across GPUs initially. At the attention layer, an all-to-all communication re-arranges data such that each GPU receives the full sequence but only for a subset of attention heads (Attention Head Parallelism). After attention, another all-to-all switches back to the original sequence-parallel layout. A key advantage is its attention mechanism agnosticism; it can wrap existing implementations like FlashAttention2 or SDPA. The authors extended the original Ulysses SP to support modern attention types (MHA, GQA, MQA) prevalent in Hugging Face models, handling cases where the number of query or key/value heads is not perfectly divisible by the Sequence Parallelism degree.
- Activation Offloading and Other PyTorch Optimizations:
- Non-invasive: The authors identified significant memory overheads from specific PyTorch versions (standardizing on 2.7.1 or nightly), collective communication methods (
all_reduce_object
vsall_reduce
), and memory fragmentation (usingPYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
). Activation checkpointing was also used to trade compute for memory. - Invasive: A crucial optimization is offloading the checkpointed
hidden_states
tensors to CPU memory during training. This dramatically reduces peak GPU memory usage by removing the dependency of peak memory on the number of model layers, allowing much longer sequence lengths. This is particularly impactful for models with many layers. However, it can shift the bottleneck to CPU memory, especially for large models and sequence lengths.
- Non-invasive: The authors identified significant memory overheads from specific PyTorch versions (standardizing on 2.7.1 or nightly), collective communication methods (
The paper also addresses practical implementation challenges for Hugging Face Transformers:
- Attention Mask: For very long sequences, the traditional 4D causal attention mask (shape
[bs, seqlen, seqlen]
) becomes prohibitively large (e.g., 116GiB at 250K sequence length). ALST usesposition_ids
(shape[bs, seqlen]
) instead and requires monkey-patching the model's_update_causal_mask
function to prevent the large mask creation. - HF Integration: ALST integrates into Hugging Face Transformers by:
- Injecting a custom wrapper (
UlyssesSPAttentionHF
) into the attention mechanism viatransformers.modeling_utils.ALL_ATTENTION_FUNCTIONS
. - Introducing a
UlyssesSPDataLoaderAdapter
to efficiently shard large sequences from any existing DataLoader. - Processing non-attention layers independently per GPU on its sequence shard.
- Injecting a custom wrapper (
- Loss Sharding: Causal LLMs require labels shifted by one position for next-token prediction. When sharding sequences for SP, naive shifting per shard leads to losing the first label of each shard. The authors modified the Hugging Face causal loss API to accept pre-shifted labels, correctly handling loss computation across shards.
The evaluation demonstrates the effectiveness of ALST. Using Llama-8B on H100 GPUs with DeepSpeed ZeRO Stage 3 and optimizer states offload to CPU as a baseline (which supports 32K sequence length), ALST achieves significant improvements:
- 1 GPU: 500K sequence length (16x improvement). This configuration requires weights offload to CPU in addition to optimizer states offload to fit the model.
- 8 GPUs (1 node): 3.7M sequence length (116x improvement).
- 32 GPUs (4 nodes): 15M sequence length (469x improvement).
Evaluations with Llama-3.1-70B and Qwen/Qwen3-32B also show multi-million sequence length capabilities, demonstrating roughly linear scaling of sequence length with the number of GPUs. For these larger models, the bottleneck at very high sequence lengths on 4 and 8 nodes became the available CPU memory for activation checkpoint offloading.
Feature ablations confirm the contribution of each ALST component. While Tiled Logits/Loss and Ulysses SP provide initial gains, Tiled MLP and particularly Activation Checkpoint Offload to CPU enable the largest sequence lengths. Tiled MLP becomes increasingly important at sequence lengths over 5M where hidden states become very large.
Training correctness validation showed that ALST achieves nearly identical loss curves compared to the 32K baseline, confirming the preservation of training quality.
Users can try ALST using the open-sourced implementation within the ArcticTraining framework [arctictraining] and DeepSpeed [deepspeed], specifically via the tutorials provided on the DeepSpeed website. ArcticTraining includes pre-configured recipes for long-sequence post-training.
Limitations and Notes:
- Current maximum Sequence Parallelism degree is limited by the number of query heads.
- The number of query/key/value heads must be divisible by the SP degree for balanced computation.
- Training with very long sequences requires a dataset with samples of corresponding length; packing short samples into a long sequence without respecting sample boundaries during training will not teach the model to handle long contexts during inference.
- FlashAttention2 handles packed samples with position ids correctly, while SDPA in Hugging Face might attend to the entire packed sequence if not using actual long samples.
Future work aims to improve performance, remove current limitations on SP degree and head count divisibility, and integrate ALST more broadly into frameworks like Hugging Face Accelerate and Trainer.