LongRoPE2: Near-Lossless LLM Context Window Scaling (2502.20082v1)

Published 27 Feb 2025 in cs.CL

Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained LLMs to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

PDF Abstract

The paper introduces LLMpresso, a method for extending the context window of pre-trained LLMs while preserving performance on shorter contexts. LLMpresso addresses the out-of-distribution (OOD) issue in rotary positional embeddings (RoPE) by focusing on the hypothesis that higher RoPE dimensions are insufficiently trained, which affects the effectiveness of existing rescaling methods. The method includes a RoPE rescaling algorithm using evolutionary search guided by "needle-driven" perplexity (PPL) and mixed context window training.

The authors identify two major challenges in extending LLM context windows:

Existing rescaling methods do not achieve the effective target context length
Performance degradation on the original short context window.

The authors attribute these issues to insufficient training in higher RoPE dimensions, resulting in shorter effective RoPE rotation ranges.

LLMpresso includes the following innovations:

A RoPE rescaling algorithm that uses evolutionary search to identify critical RoPE dimensions and optimal rescaling factors, guided by a "needle-driven" perplexity evaluation.
A mixed context window training approach, which fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving short-context performance with the original RoPE.

The RoPE is calculated as follows:

$\mathbf{q}_m=f_q(\mathbf{x}_m,m);\quad f_q(\mathbf{x}_m,m)=e^{im\theta}\mathbf{W}_q\mathbf{x}_m\$

$\mathbf{k}_n=f_k(\mathbf{x}_n,n);\quad f_k(\mathbf{x}_n,n)=e^{in\theta}\mathbf{W}_k\mathbf{x}_n\$

$\mathbf{q}_m$ : query representation at position $m$
$\mathbf{x}_m$ : sequence of vectors at position $m$
$m$ : position index
$f_q$ : function to incorporate position information to the word embeddings and transforms them into query representation
$i$ : imaginary unit
$\theta$ : per-dimensional rotation angle
$\mathbf{W}_q$ : projection matrices
$\mathbf{k}_n$ : key representation at position $n$
$f_k$ : function to incorporate position information to the word embeddings and transforms them into key representation
$n$ : position index
$\mathbf{W}_k$ : projection matrices

The attention weights are computed as:

$softmax(\frac{\mathbf{q}_m^T \mathbf{k}_n}{\sqrt{d}})$

$\mathbf{q}_m$ : query representation at position $m$
$\mathbf{k}_n$ : key representation at position $n$
$d$ : attention head dimension

The per-dimensional rotation angle for $i=0,1,...,d/2-1$ is defined as:

$f_{q,k}(n)_i=\begin{pmatrix}\text{cos}n\theta_i & -\text{sin}n\theta_i\ \text{sin}n\theta_i& \text{cos}n\theta_i\ \end{pmatrix}; \theta_i={\theta_{base}^{-2i/d}}$

$f_{q,k}(n)_i$ : per-dimensional rotation angle
$n$ : position index
$\theta_i$ : per-dimensional rotation angle for $i=0,1,...,d/2-1$
$\theta_{base}$ : a predefined RoPE base value

The corresponding period length $T_i$ can be calculated as:

$T_{i}=\frac{2\pi}{\theta_i}$

$T_{i}$ : the corresponding period length
$\theta_i$ : per-dimensional rotation angle for $i=0,1,...,d/2-1$

The critical dimension can be computed as:

$d_{\text{tcd}=2\lceil \frac{d}{2}\log_{\theta_{base} \frac{L_{\text{train}{2\pi} \rceil$

$d_{\text{tcd}}$ : theoretical critical dimension
$d$ : attention head dimension
$\theta_{base}$ : a predefined RoPE base value
$L_{\text{train}}$ : input sequence length

$\hat\theta_i=\frac{1}{\lambda_i\times{\theta_{base}^{2i/d}}$

$\hat\theta_i$ : rescaled per-dimensional rotation angle
$\lambda_i$ : rescaling factor for the $i^{\text{th}}$ RoPE dimension
$\theta_{base}$ : a predefined RoPE base value
$d$ : attention head dimension

The constraint to avoid OOD is defined as:

$\lambda_{i}\ge \frac{L}{L_\text{train}}; \quad\text{for} \quad i\ge d_{\text{tcd}}$

$\lambda_{i}$ : rescaling factor for the $i^{\text{th}}$ RoPE dimension
$L$ : target context window size
$L_\text{train}$ : pre-trained context window size
$d_{\text{tcd}}$ : theoretical critical dimension

The evolutionary search identifies the real critical dimension $d_{rcd}$ and the optimal rescaling factors using the following steps:

Initialize $d_{rcd}$ and rescaling factors
Generate $L$ -token documents
Compute PPL for each candidate by applying the rescaling factors to the LLM and evaluating the input $\mathbf{X}$ .

The theta base for $d_{rcd}$ is updated after mutation, and NTK scaling is applied to rescale factors in the lower group.

The paper presents experiments on LLaMA3-8B and Phi3-mini-3.8B. The models were extended to a 128k context window and mid-trained on 64 A100 GPUs using a 10B-token dataset. Baselines include state-of-the-art RoPE rescaling methods such as YaRN, NTK, and LongRoPE.

The evaluation included:

Long-context stress tests, including RULER and Needle in a Haystack
Real-world long-context benchmarks including LOFT, InfiniteBench, and LongBench
Standard benchmarks within a 4096-token context.

Key results include:

{LLMpresso} consistently outperformed prior methods on RULER, achieving superior results across all evaluation lengths within the 128k window
{LLMpresso} achieves near-perfect accuracy across all evaluation lengths within the 128k context window in the Needle in a Haystack test.
{LLMpresso} consistently improves performance across all benchmarks, demonstrating strong generalization to practical scenarios, on real-world benchmarks

Ablation studies validated:

The effectiveness of real critical dimension $d_{rcd}$
The effectiveness of need-PPL guided search
The effectiveness of mixed context window training

The authors conclude by noting that LLMpresso uses evolutionary search-guided rescaling and mixed context window training to achieve 128k effective context length with just 10B tokens, retaining 97.6\% of the original short-context performance.