Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongRoPE2: Scaling Context for LLMs

Updated 27 June 2026
  • LongRoPE2 is an approach for scaling transformer-based LLMs by refining rotary positional embeddings with an evolutionary needle-driven search and mixed context-window training.
  • It introduces a new hypothesis on the effective critical dimension, optimizing individual RoPE head scaling to mitigate out-of-distribution drift in high-dimensional embedding spaces.
  • Empirical results on LLaMA3-8B and Phi3-mini models show improved retrieval and benchmark scores, achieving near-lossless short-context performance with far fewer training tokens than competing methods.

LongRoPE2 is an approach for scaling the context window of pre-trained transformer-based LLMs leveraging rotary positional embedding (RoPE), achieving near-lossless retention of short-context performance, and extending effective context to targets such as 128K tokens. Developed by the LLMpresso team at Microsoft, it addresses limitations of analytic RoPE-rescaling rules and their out-of-distribution (OOD) drift in high-dimensional embedding spaces, through a combination of a new hypothesis on critical dimension, evolutionary search with “needle-driven” perplexity objectives, and mixed context-window training. LongRoPE2 has been empirically validated on LLaMA3-8B and Phi3-mini-3.8B, outperforming baselines on both synthetic and real-world long-context benchmarks, while requiring orders of magnitude fewer training tokens than competing approaches (Shang et al., 27 Feb 2025).

1. Problem Statement and Theoretical Motivation

When pre-trained LLMs are naively extrapolated to longer context windows using standard RoPE or analytic extensions—including position interpolation (PI), NTK-aware interpolation, and YaRN—the resulting models exhibit high perplexity, loss of retrieval accuracy, and substantial performance degradation on both long and short contexts. The root cause is the periodicity of RoPE per embedding dimension, where higher-index dimensions correspond to longer rotation periods Ti=2π/θiT_i = 2\pi/\theta_i, with θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}. Pre-training on a max sequence length LtrainL_{train} does not cover full periods for high dimensions idtcdi \geq d_{tcd}, leaving these subspaces under-trained and prone to OOD behavior.

LongRoPE2 introduces the hypothesis that the effective OOD critical dimension, denoted drcdd_{rcd}, is significantly lower than the analytically derived dtcdd_{tcd}, because the higher-dimensional RoPE subspaces barely experience rotation during pre-training. Analytic rescaling (e.g., uniform λi=L/Ltrain\lambda_i = L/L_{train} for all idtcdi \geq d_{tcd}) under-corrects this OOD drift, causing unrecoverable errors as the context window grows (Shang et al., 27 Feb 2025).

The LongRoPE2 framework resizes each RoPE head-dimension individually using learned scaling factors λi\lambda_i:

θ^i=1λiθbase2i/d\hat{\theta}_i = \frac{1}{\lambda_i \cdot \theta_{base}^{2i/d}}

Rather than trust solely in analytic scaling, an evolutionary strategy searches the subset of head dimensions θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}0, with θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}1 sampled in θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}2 under monotonic non-decrease constraints (θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}3). For θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}4, NTK-derived scalings are applied recursively using the θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}5 implied by θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}6.

The core search objective employs “needle-driven” perplexity (θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}7) rather than global PPL, focusing on tokens inserted at the start (“needle”) of long texts and queried at the end. This isolates retrieval dependencies across long-range sequences—standard perplexity would be dominated by local token prediction. The goal is to minimize θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}8 over the search space (Shang et al., 27 Feb 2025).

Algorithmic Overview

Initialization:

  • Compute analytic θi=θbase2i/d\theta_i = \theta_{base}^{-2i/d}9 and extended LtrainL_{train}0 (10 full periods in LtrainL_{train}1).
  • For LtrainL_{train}2, sample LtrainL_{train}3 and build candidate solutions; for LtrainL_{train}4, set via NTK.

Evolution:

  • Iteratively mutate LtrainL_{train}5, re-calculate implied LtrainL_{train}6, reapply NTK to lower dims, and evaluate LtrainL_{train}7. Only superior candidates are retained.

On Phi3-mini (d=96), this process found LtrainL_{train}8 versus analytic LtrainL_{train}9; on LLaMA3-8B (d=128), idtcdi \geq d_{tcd}0 versus idtcdi \geq d_{tcd}1 (Shang et al., 27 Feb 2025).

3. Mixed Context-Window Training

To align the model to both original (short context) and rescaled (long context) RoPE parameterizations without loss on original tasks, LongRoPE2 employs mixed context-window fine-tuning:

  • Training examples are divided into short (idtcdi \geq d_{tcd}2) and long (idtcdi \geq d_{tcd}3–idtcdi \geq d_{tcd}4) fixed-length blocks.
  • Short blocks: original RoPE, block-diagonal attention masking (no cross-document heads).
  • Long blocks: rescaled RoPE as selected by evolutionary search, with full attention allowed.

A single model set of weights is optimized using cross-entropy over all tokens, with only the positional encoding (choice of idtcdi \geq d_{tcd}5) varying by block. The training data ratio includes 3B tokens in short and mid-long blocks each, and 4B in extra-long blocks, for 10B tokens total (one epoch) and requiring only 40–54 hours on 64×A100 GPUs (Shang et al., 27 Feb 2025).

4. Empirical Results and Benchmarking

LongRoPE2 demonstrates superior or near-lossless context scaling performance, preserving short-context task results while extending long-context generalization:

RULER synthetic (128K context)

Model Phi3-mini-3.8B LLaMA3-8B
YaRN 39.37 49.39
NTK 49.37 73.19
LongRoPE 53.71 73.40
LongRoPE2 (LLMpresso) 58.81 82.03

Average score on RULER at 128K context. NTK and YaRN fall behind past 32–64K. LongRoPE2 is nearly lossless up to 64K; above this, performance degrades gracefully (Shang et al., 27 Feb 2025).

Needle in a Haystack

LLMpresso achieves near-perfect retrieval throughout depths to 128K, outperforming all baselines. Non-needle-driven search objectives identify suboptimal idtcdi \geq d_{tcd}6 and yield lower retrieval accuracy (Shang et al., 27 Feb 2025).

Real-world long-context (LOFT, InfiniteBench)

LongRoPE2 outperforms other methods by 3–10 points across 14 tasks (Shang et al., 27 Feb 2025).

Short-context retention

Model Original Score 128K (LLMpresso) % Retained
Phi3-mini(2K) 63.2 61.7 97.6%
LLaMA3-8B(8K) 56.5 55.7 98.6%
Meta-LLaMA3.1-8B(128K) 57.2 - -

LongRoPE2 retains nearly all short-context performance with only 10B additional tokens, compared to 800B for Meta’s method (Shang et al., 27 Feb 2025).

5. Ablation Analyses

A series of ablation studies confirm the specific contributions of LongRoPE2’s innovations (Shang et al., 27 Feb 2025):

  • Critical Dimension: Using search-found idtcdi \geq d_{tcd}7 in YaRN/NTK increases long-context scores by 3–5 points; analytic idtcdi \geq d_{tcd}8 overestimates the boundary.
  • Needle Objective: Global perplexity (PG19-only) does not find correct idtcdi \geq d_{tcd}9 and weakens 128K context generalization.
  • Mixed Training: Omitting mixed context-window blocks leads to a 5% drop in both short (MMLU) and long-context (RULER) metrics.
  • Dimensionality Search Range: Restricting the evolutionary search to drcdd_{rcd}0 yields slightly superior outcomes and practical efficiency.

6. Practical Implementation and Guidelines

LongRoPE2 requires minimal intervention to be applied to existing RoPE-based LLMs:

  • During inference, a “scale switch” (from original to rescaled RoPE) is deployed if the token length plus generated output exceeds drcdd_{rcd}1; requiring a one-time KV-cache recomputation.
  • Recipe for adaptation:

    1. Compute analytic drcdd_{rcd}2 and extend candidate range with drcdd_{rcd}3 (e.g., drcdd_{rcd}4).
    2. Conduct needle-driven evolutionary search over drcdd_{rcd}5 to learn drcdd_{rcd}6.
    3. Apply NTK scaling for drcdd_{rcd}7, using the new drcdd_{rcd}8 implied by drcdd_{rcd}9.
    4. Fine-tune model weights via mixed context-window training.
  • Libraries used include FlashAttention-2 and nnScaler.

  • The approach is portable to both open-source and proprietary models due to the isolated nature of positional logic changes (Shang et al., 27 Feb 2025).

7. Significance, Comparison, and Limitations

LongRoPE2 closes a previously unaddressed generalization gap by directly tackling higher-dimensional OOD drift in RoPE, enabled by an empirically validated critical dimension hypothesis and a search objective tightly coupled to the demands of long-range retrieval. Compared to analytic and grid-search tuning methods, it achieves greater context window scaling with nearly lossless preservation of short-context accuracy and two orders of magnitude reduction in the number of required mid-training tokens (10B vs 800B for equivalent Meta models).

LongRoPE2’s design and results highlight the inadequacy of theoretical boundaries derived from RoPE periodicity alone and illustrate the utility of targeted search procedures aligned to real-world compositional reasoning tasks. The mixed context-window routine preserves backward compatibility while providing a pathway for robust, scalable adaptation of legacy LLMs to future long-context benchmarks (Shang et al., 27 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongRoPE2.