This paper introduces LLMpresso (Ding et al., 21 Feb 2024 ), a novel method designed to significantly extend the context window of pre-trained LLMs using Rotary Position Embedding (RoPE). Unlike previous methods limited to around 128k tokens, LLMpresso demonstrates the capability to extend context windows up to an impressive 2048k tokens with limited fine-tuning, while preserving performance on shorter contexts.
The core of LLMpresso lies in its exploration and exploitation of previously overlooked non-uniformities within the RoPE positional embedding. Existing methods like Position Interpolation (PI), NTK, and YaRN apply interpolation or extrapolation based on fixed rules or frequency groups. LLMpresso empirically identifies two crucial forms of non-uniformity:
- Varying RoPE dimensions: Different dimensions of the RoPE vector benefit from different interpolation or extrapolation factors. Lower dimensions (higher frequency) require less interpolation than higher dimensions (lower frequency), but the optimal factors depend on the target extended length.
- Token positions: Initial tokens in the sequence benefit from less positional interpolation or even direct extrapolation, as they are critical for attention mechanisms. The optimal number of such tokens () also depends on the target length.
To effectively leverage these non-uniformities, LLMpresso frames the problem as searching for optimal rescale factors for each RoPE dimension () and an optimal threshold for initial tokens (). The search space for these factors is vast. LLMpresso employs an evolutionary search algorithm to navigate this complex space efficiently. Key optimizations for the search include:
- Optimized initial population: Seeding the initial search population with configurations from existing methods (PI, NTK, YaRN) and their mutations.
- Monotonically non-decreasing constraint: Imposing the constraint that across RoPE dimensions to reduce the search space and align with the NTK theory's observation about frequency dependency.
This search-based approach to non-uniform interpolation offers significant benefits. Empirically, the paper shows that applying the searched non-uniform RoPE factors alone can enable an 8 context window extension (e.g., from 4k to 32k) without any fine-tuning, outperforming existing methods that struggle beyond 2.
To achieve the extremely large 2048k context window, LLMpresso proposes a progressive extension strategy:
- Initial Search and Fine-tuning: The pre-trained LLM (e.g., LLaMA2-7B or Mistral-7B) is first extended to a moderately long context window (e.g., 128k or 256k) by searching for optimal RoPE factors using the evolutionary algorithm and then fine-tuning the model on datasets chunked to this length.
- Secondary Search for Extreme Length: A second evolutionary search is performed on the fine-tuned extended LLM to find new optimal RoPE rescale factors for the target 2048k context window. This step achieves the 2048k extension without requiring further fine-tuning on extremely long texts, which are scarce and computationally expensive. The success of this step relies on the fact that the fine-tuned model at 256k provides a much better starting point for extending to 2048k (an 8 extension ratio) compared to extending the original 4k model directly to 2048k (a 512 extension ratio).
A known challenge with context window extension via positional interpolation is performance degradation on the original, shorter context lengths. To address this, LLMpresso includes a recovery step:
- Shorter Context Window Recovery: An additional evolutionary search is conducted on the 2048k-extended model specifically for shorter context lengths (e.g., 4k or 8k). This search encourages less interpolation for these shorter lengths. During inference, the model dynamically adjusts the RoPE rescale factors based on the input sequence length, using the factors optimized for the current length.
The effectiveness of LLMpresso is demonstrated through extensive experiments on LLaMA2-7B and Mistral-7B models.
- Long Sequence LLMing: LLMpresso-extended models achieve significantly lower perplexity on long document datasets like Proof-pile, PG19, and Books3 compared to baselines (PI, NTK, YaRN) across a wide range of context lengths, including the unprecedented 2048k. The models show consistent performance improvement with increasing context length.
- Passkey Retrieval: LLMpresso-LLaMA2-2048k (fine-tuned at 256k) maintains high passkey retrieval accuracy (90%) up to 2048k tokens. LLMpresso-Mistral-2048k (fine-tuned at 128k) achieves 100% accuracy up to 1800k tokens. This highlights their practical ability to utilize extremely long contexts for specific tasks.
- Standard Benchmarks: Evaluated on standard short-context benchmarks (ARC-Challenge, HellaSwag, MMLU, TruthfulQA), LLMpresso-extended models maintain performance comparable to or even slightly better than the original non-extended models and existing long-context baselines, indicating the recovery mechanism is effective at mitigating performance degradation on short inputs.
Ablation studies confirm the contribution of the progressive extension and the two forms of non-uniformities. The secondary search on the fine-tuned model is shown to be crucial for extending beyond the fine-tuning length. The non-uniformity search, particularly considering both dimensions and initial tokens, significantly improves performance compared to linear interpolation and dimension-only search.
In summary, LLMpresso presents a practical and effective approach to extending LLM context windows to unprecedented lengths by intelligently leveraging non-uniformities in RoPE via an evolutionary search and employing an efficient progressive extension strategy. This enables LLMs to process and understand extremely long documents, opening doors for new applications. The method requires only minor modifications to the positional embedding and can reuse existing optimizations, making it highly applicable in practice. The code and models are planned for release to facilitate further research and application.