LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Published 21 Feb 2024 in cs.CL | (2402.13753v1)

Abstract: Large context window is a desirable feature in LLMs. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.

Abstract PDF HTML Upgrade to Chat

Authors (8)

References (41)

Citations (92)

View on Semantic Scholar

Summary

The paper introduces LLMpresso, which extends LLM context windows up to 2048k tokens through an evolutionary search for optimal non-uniform RoPE scaling factors.
It employs a progressive extension strategy with fine-tuning to maintain performance on shorter inputs while scaling to unprecedented context lengths.
Empirical results on models like LLaMA2-7B and Mistral-7B show significantly lower perplexity and high passkey retrieval accuracy, outperforming previous methods.

This paper introduces LLMpresso (2402.13753), a novel method designed to significantly extend the context window of pre-trained LLMs using Rotary Position Embedding (RoPE). Unlike previous methods limited to around 128k tokens, LLMpresso demonstrates the capability to extend context windows up to an impressive 2048k tokens with limited fine-tuning, while preserving performance on shorter contexts.

The core of LLMpresso lies in its exploration and exploitation of previously overlooked non-uniformities within the RoPE positional embedding. Existing methods like Position Interpolation (PI), NTK, and YaRN apply interpolation or extrapolation based on fixed rules or frequency groups. LLMpresso empirically identifies two crucial forms of non-uniformity:

Varying RoPE dimensions: Different dimensions of the RoPE vector benefit from different interpolation or extrapolation factors. Lower dimensions (higher frequency) require less interpolation than higher dimensions (lower frequency), but the optimal factors depend on the target extended length.
Token positions: Initial tokens in the sequence benefit from less positional interpolation or even direct extrapolation, as they are critical for attention mechanisms. The optimal number of such tokens ( $\hat{n}$ ) also depends on the target length.

To effectively leverage these non-uniformities, LLMpresso frames the problem as searching for optimal rescale factors for each RoPE dimension ( $\lambda_i$ ) and an optimal threshold for initial tokens ( $\hat{n}$ ). The search space for these factors is vast. LLMpresso employs an evolutionary search algorithm to navigate this complex space efficiently. Key optimizations for the search include:

Optimized initial population: Seeding the initial search population with configurations from existing methods (PI, NTK, YaRN) and their mutations.
Monotonically non-decreasing constraint: Imposing the constraint that $\lambda_i \le \lambda_{i+1}$ across RoPE dimensions to reduce the search space and align with the NTK theory's observation about frequency dependency.

This search-based approach to non-uniform interpolation offers significant benefits. Empirically, the paper shows that applying the searched non-uniform RoPE factors alone can enable an 8 $\times$ context window extension (e.g., from 4k to 32k) without any fine-tuning, outperforming existing methods that struggle beyond 2 $\times$ .

To achieve the extremely large 2048k context window, LLMpresso proposes a progressive extension strategy:

Initial Search and Fine-tuning: The pre-trained LLM (e.g., LLaMA2-7B or Mistral-7B) is first extended to a moderately long context window (e.g., 128k or 256k) by searching for optimal RoPE factors using the evolutionary algorithm and then fine-tuning the model on datasets chunked to this length.
Secondary Search for Extreme Length: A second evolutionary search is performed on the fine-tuned extended LLM to find new optimal RoPE rescale factors for the target 2048k context window. This step achieves the 2048k extension without requiring further fine-tuning on extremely long texts, which are scarce and computationally expensive. The success of this step relies on the fact that the fine-tuned model at 256k provides a much better starting point for extending to 2048k (an 8 $\times$ extension ratio) compared to extending the original 4k model directly to 2048k (a 512 $\times$ extension ratio).

A known challenge with context window extension via positional interpolation is performance degradation on the original, shorter context lengths. To address this, LLMpresso includes a recovery step:

Shorter Context Window Recovery: An additional evolutionary search is conducted on the 2048k-extended model specifically for shorter context lengths (e.g., 4k or 8k). This search encourages less interpolation for these shorter lengths. During inference, the model dynamically adjusts the RoPE rescale factors based on the input sequence length, using the factors optimized for the current length.

The effectiveness of LLMpresso is demonstrated through extensive experiments on LLaMA2-7B and Mistral-7B models.

Long Sequence Language Modeling: LLMpresso-extended models achieve significantly lower perplexity on long document datasets like Proof-pile, PG19, and Books3 compared to baselines (PI, NTK, YaRN) across a wide range of context lengths, including the unprecedented 2048k. The models show consistent performance improvement with increasing context length.
Passkey Retrieval: LLMpresso-LLaMA2-2048k (fine-tuned at 256k) maintains high passkey retrieval accuracy ( $\ge$ 90%) up to 2048k tokens. LLMpresso-Mistral-2048k (fine-tuned at 128k) achieves 100% accuracy up to 1800k tokens. This highlights their practical ability to utilize extremely long contexts for specific tasks.
Standard Benchmarks: Evaluated on standard short-context benchmarks (ARC-Challenge, HellaSwag, MMLU, TruthfulQA), LLMpresso-extended models maintain performance comparable to or even slightly better than the original non-extended models and existing long-context baselines, indicating the recovery mechanism is effective at mitigating performance degradation on short inputs.

Ablation studies confirm the contribution of the progressive extension and the two forms of non-uniformities. The secondary search on the fine-tuned model is shown to be crucial for extending beyond the fine-tuning length. The non-uniformity search, particularly considering both dimensions and initial tokens, significantly improves performance compared to linear interpolation and dimension-only search.

In summary, LLMpresso presents a practical and effective approach to extending LLM context windows to unprecedented lengths by intelligently leveraging non-uniformities in RoPE via an evolutionary search and employing an efficient progressive extension strategy. This enables LLMs to process and understand extremely long documents, opening doors for new applications. The method requires only minor modifications to the positional embedding and can reuse existing optimizations, making it highly applicable in practice. The code and models are planned for release to facilitate further research and application.

Markdown Report Issue