From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models (2504.06214v1)

Published 8 Apr 2025 in cs.CL, cs.AI, and cs.LG

Abstract: Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling, all of which require models to process and reason over long sequences of text and multimodal data. In this work, we introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach leverages efficient continued pretraining strategies to extend the context window and employs effective instruction tuning to maintain the instruction-following and reasoning abilities. Our UltraLong-8B, built on Llama3.1-Instruct with our recipe, achieves state-of-the-art performance across a diverse set of long-context benchmarks. Importantly, models trained with our approach maintain competitive performance on standard benchmarks, demonstrating balanced improvements for both long and short context tasks. We further provide an in-depth analysis of key design choices, highlighting the impacts of scaling strategies and data composition. Our findings establish a robust framework for efficiently scaling context lengths while preserving general model capabilities. We release all model weights at: https://ultralong.github.io/.

PDF Abstract

This paper presents an efficient two-stage training recipe for extending the context window of existing instruction-tuned LLMs to ultra-long lengths, specifically 1 million (1M), 2 million (2M), and 4 million (4M) tokens, starting from a model with a 128K context window (Llama-3.1-8B-Instruct). The key goal is to achieve strong long-context performance without degrading performance on standard, short-context tasks (Xu et al., 8 Apr 2025 ).

Training Pipeline Overview

The method consists of two main stages:

Continued Pretraining for Context Extension: This stage focuses solely on adapting the model to handle longer sequences.
Instruction Tuning: This stage refines the model's ability to follow instructions and reason, using only short-context data.

Training Pipeline Diagram Figure 1: Overview of the training pipeline (Xu et al., 8 Apr 2025 ).

Stage 1: Continued Pretraining

Base Model: Llama-3.1-8B-Instruct (128K context).
Data: A 1 billion token corpus derived from SlimPajama [cerebras2023slimpajama], engineered for long contexts by downsampling documents < 4K tokens and upsampling documents > 8K tokens, following Fu et al. [fu2024data].
Document Concatenation: Documents are concatenated to match the target context length (1M, 2M, or 4M). Crucially, a special separator token (<s>) is used between documents instead of standard begin_of_text / end_of_text tokens.
Attention: Full attention is applied across the entire concatenated sequence. No cross-document attention masking is used, allowing the model to attend across document boundaries separated by <s>.
Positional Embeddings (RoPE Scaling): YaRN (Yet another RoPE extensioN) [peng2023yarn] is used to scale the Rotary Position Embeddings (RoPE) [su2024roformer] to the target lengths. They fix YaRN hyperparameters $\alpha=1$ and $\beta=4$ and compute the scaling factor $s$ based on the target length ( $s=128$ for 1M, $s=256$ for 2M, $s=512$ for 4M). This choice was found to be more effective than NTK-aware scaling used in prior works.
Training Strategy: A one-step continued pretraining approach is used, directly extending from 128K to the target length (1M, 2M, or 4M) in a single stage. This was found more effective than multi-step extension (e.g., 128K -> 512K -> 1M).
Implementation: Trained using Megatron-LM [shoeybi2019megatron] on 256 NVIDIA H100 GPUs. Utilizes Tensor Parallelism (TP=8) and Context Parallelism (CP=4 for 1M, CP=16 for 2M/4M) to manage the large context sizes. Training takes approximately 5, 6, and 13 hours for the 1M, 2M, and 4M models, respectively, using 1B tokens and a learning rate of $3 \times 10^{-5}$ .

Stage 2: Instruction Tuning

Goal: Restore/enhance instruction-following and reasoning abilities after the context extension pretraining.
Data: A high-quality blend of 100K short-context (< 8K tokens) instruction-following examples, covering general, math, and code domains. The data sources include ShareGPT, SlimOrca, EvolInstruct, GPTeacher, OrcaMath, MathInstruct, MetaMath, Magicoder, WizardCoder, etc. Responses were refined using GPT-4o/mini and decontaminated against benchmarks. Notably, no synthetic long-context instruction data was used.
Implementation: Trained using Megatron-LM on 256 H100 GPUs (TP=8) for approximately 30 minutes, using a batch size of 128 and a learning rate of $5 \times 10^{-6}$ .

Key Results and Findings

Long-Context Performance: The resulting UltraLong-8B models achieve state-of-the-art performance on long-context benchmarks like RULER [hsieh2024ruler], LV-Eval [yuan2024lv], and InfiniteBench [zhang-etal-2024-bench], significantly outperforming baseline Llama-3 models with extended context windows trained using other methods (ProLong [gao2024train], Gradient [gradientlongcontextllama3]).
Needle-in-a-Haystack (NIAH): UltraLong models achieve 100% accuracy on the NIAH passkey retrieval task up to their maximum context lengths (1M, 2M, 4M), demonstrating robust information retrieval over long distances. Baselines showed significant failures.
Standard Benchmark Performance: Unlike other long-context models which often suffer degradation on standard tasks, the UltraLong models maintain or even slightly improve performance compared to the original Llama-3.1-8B-Instruct base model on benchmarks like MMLU, MATH, GSM-8K, and HumanEval. This highlights the effectiveness of the two-stage approach and the use of short-context SFT data.
Ablation Studies:
- Using the special document separator (<s>) and allowing cross-document attention during continued pretraining significantly improves performance compared to removing separators.
- YaRN-based RoPE scaling is more effective for robustly extending context compared to NTK-aware scaling, especially at ultra-long lengths.
- The one-step continued pretraining strategy is more efficient and yields better results than multi-step strategies for the same compute budget.

Practical Implications and Implementation

Efficient Context Extension: The paper provides a validated, efficient recipe (1B tokens for pretraining, short SFT) to extend capable instruction-tuned models like Llama-3.1 to handle much longer contexts (up to 4M tokens).
Preserving Short-Context Abilities: The use of only short-context data during instruction tuning is sufficient to maintain strong performance on standard tasks, avoiding the common trade-off where long-context ability comes at the cost of general capability.
Key Techniques: The specific choices of using special document separators without attention masking and employing YaRN for RoPE scaling are crucial for the recipe's success.
Hardware Considerations: Training requires significant computational resources (256 H100 GPUs), primarily due to the memory demands of the long context lengths. Context Parallelism (CP) is essential for distributing the activation memory.

Limitations

The instruction tuning stage only uses Supervised Fine-Tuning (SFT). It does not explore reinforcement learning or preference optimization techniques (like RLHF or DPO).
The work does not explicitly address safety alignment for the ultra-long context models.

In summary, this paper offers a practical and efficient methodology for significantly extending the context length of LLMs while preserving their core capabilities. The UltraLong-8B models demonstrate strong performance on both long-context and standard benchmarks, validating the effectiveness of the proposed two-stage training recipe involving specific data handling, YaRN scaling, and short-context instruction tuning (Xu et al., 8 Apr 2025 ).