Papers
Topics
Authors
Recent
Search
2000 character limit reached

LongRoPE: Extending RoPE for LLMs

Updated 21 February 2026
  • LongRoPE is a methodology that extends rotary positional embeddings for LLMs, significantly increasing the practical context window while maintaining robust performance.
  • It employs dimension-specific rescaling and preserves early-token information through a start-token threshold, optimized with evolutionary search to mitigate nonuniformities.
  • The approach uses progressive fine-tuning and remains compatible with efficient inference kernels like FlashAttention-2, ensuring scalable improvements in long-context tasks.

LongRoPE encompasses a family of methodologies for extending the practical context window of LLMs utilizing rotary positional embeddings (RoPE) to lengths far exceeding original pretraining regimes. By exploiting dimension- and token-specific nonuniformities and leveraging efficient search and fine-tuning strategies, LongRoPE achieves multi-order-of-magnitude improvements in context window size while maintaining strong or near-original performance on both long and short contexts. The approach is implemented with minimal architectural changes and retains compatibility with highly optimized inference kernels such as FlashAttention-2.

1. Rotary Positional Embedding (RoPE) Foundations

RoPE is a class of absolute positional encodings that imparts relative position awareness to transformer-based attention mechanisms. For a hidden-state dimensionality dd per attention head, RoPE assigns dimension pairs (2j,2j+1)(2j, 2j+1) with frequencies θj=b2j/d\theta_j = b^{-2j/d} (usually b=10000b = 10\,000), forming d/2d/2 subspaces. At sequence position mm, the embedding applies a 2×22\times 2 rotation block: R(θj,m)=[cos(mθj)sin(mθj) sin(mθj)cos(mθj)]R(\theta_j, m) = \begin{bmatrix} \cos(m\theta_j) & -\sin(m\theta_j) \ \sin(m\theta_j) & \cos(m\theta_j) \end{bmatrix} The full position-dependent query/key is constructed as a block-diagonal transformation aggregating all subspaces. Attention scores depend exclusively on the relative difference (nm)(n - m): qmkn=qR(θ,nm)k,q_m^\top k_n = q^\top\, R(\theta, n-m) \,k, with R(θ,nm)R(\theta, n-m) the block-diagonal over all frequency bands. This construction makes RoPE highly compatible with relative positional reasoning and compositional generalization.

2. Non-Uniformities in RoPE and the Core LongRoPE Algorithm

LongRoPE addresses two previously under-exploited non-uniformities in RoPE's positional encoding that become especially critical for scaling to very large context windows:

  1. Dimension-wise sensitivity: Lower-frequency RoPE dimensions (large wavelengths, small jj) exhibit higher tolerance to extrapolation; higher-frequency dimensions degrade rapidly outside the pre-trained length due to underexposure during training.
  2. Early-token importance: Initial token positions contribute disproportionately to attention and should be preserved in their original parameterization during extrapolation.

To address these, LongRoPE parameterizes RoPE extension with:

  • A per-dimension scaling vector {λi}\{\lambda_i\}, whereby each dimension ii applies individualized rescale factors in the rotary angles.
  • A start-token threshold n^\hat n below which positional encodings are not interpolated or rescaled.

The generalized rotary angle is then defined: αn,i=I(n<n^)(nθi)+I(nn^)(nθiλi).\alpha'_{n,i} = \mathbb I(n<\hat n)\,(n\theta_i) + \mathbb I(n \geq \hat n)\Bigl(\frac{n\theta_i}{\lambda_i}\Bigr). The optimal {λi},n^\{\lambda_i\},\, \hat n are identified via evolutionary search, seeded with known baseline schemes (e.g., PI, NTK, YaRN) plus random perturbations, with a monotonicity constraint (λ0λ1\lambda_0 \leq \lambda_1 \leq \cdots). Each candidate is evaluated on token-level perplexity using held-out documents at the new, extended context window.

3. Progressive Extension, Fine-Tuning, and Evaluation

LongRoPE employs a progressive extension and fine-tuning pipeline:

  1. Stepwise length extension: Starting from the pretrained model, LongRoPE conducts evolutionary search and fine-tuning at intermediate context lengths (e.g., from 4K4\mathrm{K} to 128K128\mathrm{K}, then 256K256\mathrm{K}), with each stage comprising search for the optimal rescale parameters, followed by a brief fine-tuning (e.g., 400–600 steps) on extended-length sequences.
  2. Final “no-finetune” extrapolation: On the final fine-tuned checkpoint, an additional search yields parameters for the maximum window (e.g., 2.048M2.048\mathrm{M} tokens), which are applied without further weight updates.
  3. Short-window readjustment: After large-scale extension, the original short context window(s) are over-compressed. A final search (with tight λi\lambda_i caps, e.g., 8\leq 8) readjusts parameters at short lengths, dynamically selected at inference.

This regimen ensures robust performance not only at dramatically longer contexts, but also recovers the original accuracy on “canonical” inference windows.

4. Architectural Compatibility, Efficiency, and Implementation

LongRoPE modifies only the RoPE angle lookup, not the transformer weights, residuals, MLPs, or attention implementation. All optimizations (such as FlashAttention-2, quantization, and distributed parallelization) remain fully usable. The d×nd \times n angle table is swapped at inference-time according to context length.

On hardware, evolutionary search to 256K tokens typically requires 3 days on 1×A100 GPU. Fine-tuning at each stage involves modest compute (e.g., LLaMA2-7B, batch=32, 8×A100 for 1 week at 128K; 16×A100 for 2 weeks at 256K). For max-length jumps (e.g., 2M), up to 8×A100 for 5 days may be consumed.

5. Empirical Performance

LongRoPE delivers state-of-the-art extension results across multiple LLMs with minimal resource cost. Notable empirical findings include (Ding et al., 2024):

  • Long-document perplexity: LLaMA2-7B with LongRoPE-2048K (trained at 256K, then extrapolated to 2M) achieves Proof-Pile PPL of 1.87@262K1.87@262\mathrm{K}. Original LLaMA2-7B fails for >>8K due to extrapolation collapse.
  • Synthetic “passkey retrieval” tasks: LongRoPE sustains 90%\geq 90\% accuracy from 4K2.048M4\mathrm{K}\to2.048\mathrm{M}; baselines drop to zero above 128K.
  • Standard benchmarks (short context): After full extension, drop is small: ARC-Ch 53.151.053.1\to51.0, HellaSwag 78.675.378.6\to75.3, MMLU 46.639.646.6\to39.6, TruthfulQA 39.037.339.0\to37.3.

Table: Sample Proof-Pile Perplexity (LLaMA2-7B) | Model | 8K | 256K | 2M | |-------------------------------|-------|------|------| | Original | 3.58 | N/A | N/A | | LongRoPE-2048K (ft=256K) | 3.65 | 2.26 | 1.87 | | YaRN-128K | 6.38 | N/A | N/A |

Further, LongRoPE demonstrates that both “dim-wise” and “start-token” nonuniformities are necessary: ablations in (Ding et al., 2024) confirm incremental perplexity reductions from each innovation.

6. Limitations, Discussion, and Future Prospects

Principal limitations arise in computational cost for very large (>1M>1\mathrm{M}) context search, which remains substantial (several days of multi-GPU time per model/length pair). On a small set of few-shot tasks, minor degradation of 1–2 points is seen post-extension. The utility of the start-token threshold n^\hat n appears to diminish beyond 1M tokens.

Potential extensions include:

LongRoPE's dimension-weighted, progressive strategy for RoPE extension currently represents a state-of-the-art approach to context scaling in LLMs, providing a scalable and practically robust foundation for very long-context autoregressive modeling (Ding et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to LongRoPE.