Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongRoPE2: Near-Lossless LLM Context Window Scaling (2502.20082v1)

Published 27 Feb 2025 in cs.CL

Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained LLMs to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

The paper introduces LLMpresso, a method for extending the context window of pre-trained LLMs while preserving performance on shorter contexts. LLMpresso addresses the out-of-distribution (OOD) issue in rotary positional embeddings (RoPE) by focusing on the hypothesis that higher RoPE dimensions are insufficiently trained, which affects the effectiveness of existing rescaling methods. The method includes a RoPE rescaling algorithm using evolutionary search guided by "needle-driven" perplexity (PPL) and mixed context window training.

The authors identify two major challenges in extending LLM context windows:

  • Existing rescaling methods do not achieve the effective target context length
  • Performance degradation on the original short context window.

The authors attribute these issues to insufficient training in higher RoPE dimensions, resulting in shorter effective RoPE rotation ranges.

LLMpresso includes the following innovations:

  • A RoPE rescaling algorithm that uses evolutionary search to identify critical RoPE dimensions and optimal rescaling factors, guided by a "needle-driven" perplexity evaluation.
  • A mixed context window training approach, which fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving short-context performance with the original RoPE.

The RoPE is calculated as follows:

$\mathbf{q}_m=f_q(\mathbf{x}_m,m);\quad f_q(\mathbf{x}_m,m)=e^{im\theta}\mathbf{W}_q\mathbf{x}_m\$

$\mathbf{k}_n=f_k(\mathbf{x}_n,n);\quad f_k(\mathbf{x}_n,n)=e^{in\theta}\mathbf{W}_k\mathbf{x}_n\$

  • qm\mathbf{q}_m: query representation at position mm
  • xm\mathbf{x}_m: sequence of vectors at position mm
  • mm: position index
  • fqf_q: function to incorporate position information to the word embeddings and transforms them into query representation
  • ii: imaginary unit
  • θ\theta: per-dimensional rotation angle
  • Wq\mathbf{W}_q: projection matrices
  • kn\mathbf{k}_n: key representation at position nn
  • fkf_k: function to incorporate position information to the word embeddings and transforms them into key representation
  • nn: position index
  • Wk\mathbf{W}_k: projection matrices

The attention weights are computed as:

softmax(qmTknd)softmax(\frac{\mathbf{q}_m^T \mathbf{k}_n}{\sqrt{d}})

  • qm\mathbf{q}_m: query representation at position mm
  • kn\mathbf{k}_n: key representation at position nn
  • dd: attention head dimension

The per-dimensional rotation angle for i=0,1,...,d/21i=0,1,...,d/2-1 is defined as:

fq,k(n)i=(cosnθisinnθi sinnθicosnθi );θi=θbase2i/df_{q,k}(n)_i=\begin{pmatrix}\text{cos}n\theta_i & -\text{sin}n\theta_i\ \text{sin}n\theta_i& \text{cos}n\theta_i\ \end{pmatrix}; \theta_i={\theta_{base}^{-2i/d}}

  • fq,k(n)if_{q,k}(n)_i: per-dimensional rotation angle
  • nn: position index
  • θi\theta_i: per-dimensional rotation angle for i=0,1,...,d/21i=0,1,...,d/2-1
  • θbase\theta_{base}: a predefined RoPE base value

The corresponding period length TiT_i can be calculated as:

Ti=2πθiT_{i}=\frac{2\pi}{\theta_i}

  • TiT_{i}: the corresponding period length
  • θi\theta_i: per-dimensional rotation angle for i=0,1,...,d/21i=0,1,...,d/2-1

The critical dimension can be computed as:

$d_{\text{tcd}=2\lceil \frac{d}{2}\log_{\theta_{base} \frac{L_{\text{train}{2\pi} \rceil$

  • dtcdd_{\text{tcd}}: theoretical critical dimension
  • dd: attention head dimension
  • θbase\theta_{base}: a predefined RoPE base value
  • LtrainL_{\text{train}}: input sequence length

$\hat\theta_i=\frac{1}{\lambda_i\times{\theta_{base}^{2i/d}}$

  • θ^i\hat\theta_i: rescaled per-dimensional rotation angle
  • λi\lambda_i: rescaling factor for the ithi^{\text{th}} RoPE dimension
  • θbase\theta_{base}: a predefined RoPE base value
  • dd: attention head dimension

The constraint to avoid OOD is defined as:

λiLLtrain;foridtcd\lambda_{i}\ge \frac{L}{L_\text{train}}; \quad\text{for} \quad i\ge d_{\text{tcd}}

  • λi\lambda_{i}: rescaling factor for the ithi^{\text{th}} RoPE dimension
  • LL: target context window size
  • LtrainL_\text{train}: pre-trained context window size
  • dtcdd_{\text{tcd}}: theoretical critical dimension

The evolutionary search identifies the real critical dimension drcdd_{rcd} and the optimal rescaling factors using the following steps:

  1. Initialize drcdd_{rcd} and rescaling factors
  2. Generate LL-token documents
  3. Compute PPL for each candidate by applying the rescaling factors to the LLM and evaluating the input X\mathbf{X}.

The theta base for drcdd_{rcd} is updated after mutation, and NTK scaling is applied to rescale factors in the lower group.

The paper presents experiments on LLaMA3-8B and Phi3-mini-3.8B. The models were extended to a 128k context window and mid-trained on 64 A100 GPUs using a 10B-token dataset. Baselines include state-of-the-art RoPE rescaling methods such as YaRN, NTK, and LongRoPE.

The evaluation included:

  • Long-context stress tests, including RULER and Needle in a Haystack
  • Real-world long-context benchmarks including LOFT, InfiniteBench, and LongBench
  • Standard benchmarks within a 4096-token context.

Key results include:

  • {LLMpresso} consistently outperformed prior methods on RULER, achieving superior results across all evaluation lengths within the 128k window
  • {LLMpresso} achieves near-perfect accuracy across all evaluation lengths within the 128k context window in the Needle in a Haystack test.
  • {LLMpresso} consistently improves performance across all benchmarks, demonstrating strong generalization to practical scenarios, on real-world benchmarks

Ablation studies validated:

  • The effectiveness of real critical dimension drcdd_{rcd}
  • The effectiveness of need-PPL guided search
  • The effectiveness of mixed context window training

The authors conclude by noting that LLMpresso uses evolutionary search-guided rescaling and mixed context window training to achieve 128k effective context length with just 10B tokens, retaining 97.6\% of the original short-context performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Ning Shang (8 papers)
  2. Li Lyna Zhang (20 papers)
  3. Siyuan Wang (73 papers)
  4. Gaokai Zhang (2 papers)
  5. Gilsinia Lopez (2 papers)
  6. Fan Yang (877 papers)
  7. Weizhu Chen (128 papers)
  8. Mao Yang (62 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com