This paper presents an efficient two-stage training recipe for extending the context window of existing instruction-tuned LLMs to ultra-long lengths, specifically 1 million (1M), 2 million (2M), and 4 million (4M) tokens, starting from a model with a 128K context window (Llama-3.1-8B-Instruct). The key goal is to achieve strong long-context performance without degrading performance on standard, short-context tasks (Xu et al., 8 Apr 2025 ).
Training Pipeline Overview
The method consists of two main stages:
- Continued Pretraining for Context Extension: This stage focuses solely on adapting the model to handle longer sequences.
- Instruction Tuning: This stage refines the model's ability to follow instructions and reason, using only short-context data.
Figure 1: Overview of the training pipeline (Xu et al., 8 Apr 2025
).
Stage 1: Continued Pretraining
- Base Model: Llama-3.1-8B-Instruct (128K context).
- Data: A 1 billion token corpus derived from SlimPajama [cerebras2023slimpajama], engineered for long contexts by downsampling documents < 4K tokens and upsampling documents > 8K tokens, following Fu et al. [fu2024data].
- Document Concatenation: Documents are concatenated to match the target context length (1M, 2M, or 4M). Crucially, a special separator token (
<s>
) is used between documents instead of standardbegin_of_text
/end_of_text
tokens. - Attention: Full attention is applied across the entire concatenated sequence. No cross-document attention masking is used, allowing the model to attend across document boundaries separated by
<s>
. - Positional Embeddings (RoPE Scaling): YaRN (Yet another RoPE extensioN) [peng2023yarn] is used to scale the Rotary Position Embeddings (RoPE) [su2024roformer] to the target lengths. They fix YaRN hyperparameters and and compute the scaling factor based on the target length ( for 1M, for 2M, for 4M). This choice was found to be more effective than NTK-aware scaling used in prior works.
- Training Strategy: A one-step continued pretraining approach is used, directly extending from 128K to the target length (1M, 2M, or 4M) in a single stage. This was found more effective than multi-step extension (e.g., 128K -> 512K -> 1M).
- Implementation: Trained using Megatron-LM [shoeybi2019megatron] on 256 NVIDIA H100 GPUs. Utilizes Tensor Parallelism (TP=8) and Context Parallelism (CP=4 for 1M, CP=16 for 2M/4M) to manage the large context sizes. Training takes approximately 5, 6, and 13 hours for the 1M, 2M, and 4M models, respectively, using 1B tokens and a learning rate of .
Stage 2: Instruction Tuning
- Goal: Restore/enhance instruction-following and reasoning abilities after the context extension pretraining.
- Data: A high-quality blend of 100K short-context (< 8K tokens) instruction-following examples, covering general, math, and code domains. The data sources include ShareGPT, SlimOrca, EvolInstruct, GPTeacher, OrcaMath, MathInstruct, MetaMath, Magicoder, WizardCoder, etc. Responses were refined using GPT-4o/mini and decontaminated against benchmarks. Notably, no synthetic long-context instruction data was used.
- Implementation: Trained using Megatron-LM on 256 H100 GPUs (TP=8) for approximately 30 minutes, using a batch size of 128 and a learning rate of .
Key Results and Findings
- Long-Context Performance: The resulting UltraLong-8B models achieve state-of-the-art performance on long-context benchmarks like RULER [hsieh2024ruler], LV-Eval [yuan2024lv], and InfiniteBench [zhang-etal-2024-bench], significantly outperforming baseline Llama-3 models with extended context windows trained using other methods (ProLong [gao2024train], Gradient [gradientlongcontextllama3]).
- Needle-in-a-Haystack (NIAH): UltraLong models achieve 100% accuracy on the NIAH passkey retrieval task up to their maximum context lengths (1M, 2M, 4M), demonstrating robust information retrieval over long distances. Baselines showed significant failures.
- Standard Benchmark Performance: Unlike other long-context models which often suffer degradation on standard tasks, the UltraLong models maintain or even slightly improve performance compared to the original Llama-3.1-8B-Instruct base model on benchmarks like MMLU, MATH, GSM-8K, and HumanEval. This highlights the effectiveness of the two-stage approach and the use of short-context SFT data.
- Ablation Studies:
- Using the special document separator (
<s>
) and allowing cross-document attention during continued pretraining significantly improves performance compared to removing separators. - YaRN-based RoPE scaling is more effective for robustly extending context compared to NTK-aware scaling, especially at ultra-long lengths.
- The one-step continued pretraining strategy is more efficient and yields better results than multi-step strategies for the same compute budget.
- Using the special document separator (
Practical Implications and Implementation
- Efficient Context Extension: The paper provides a validated, efficient recipe (1B tokens for pretraining, short SFT) to extend capable instruction-tuned models like Llama-3.1 to handle much longer contexts (up to 4M tokens).
- Preserving Short-Context Abilities: The use of only short-context data during instruction tuning is sufficient to maintain strong performance on standard tasks, avoiding the common trade-off where long-context ability comes at the cost of general capability.
- Key Techniques: The specific choices of using special document separators without attention masking and employing YaRN for RoPE scaling are crucial for the recipe's success.
- Hardware Considerations: Training requires significant computational resources (256 H100 GPUs), primarily due to the memory demands of the long context lengths. Context Parallelism (CP) is essential for distributing the activation memory.
Limitations
- The instruction tuning stage only uses Supervised Fine-Tuning (SFT). It does not explore reinforcement learning or preference optimization techniques (like RLHF or DPO).
- The work does not explicitly address safety alignment for the ultra-long context models.
In summary, this paper offers a practical and efficient methodology for significantly extending the context length of LLMs while preserving their core capabilities. The UltraLong-8B models demonstrate strong performance on both long-context and standard benchmarks, validating the effectiveness of the proposed two-stage training recipe involving specific data handling, YaRN scaling, and short-context instruction tuning (Xu et al., 8 Apr 2025 ).