Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens (2402.13753v1)

Published 21 Feb 2024 in cs.CL
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Abstract: Large context window is a desirable feature in LLMs. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.

This paper introduces LLMpresso (Ding et al., 21 Feb 2024 ), a novel method designed to significantly extend the context window of pre-trained LLMs using Rotary Position Embedding (RoPE). Unlike previous methods limited to around 128k tokens, LLMpresso demonstrates the capability to extend context windows up to an impressive 2048k tokens with limited fine-tuning, while preserving performance on shorter contexts.

The core of LLMpresso lies in its exploration and exploitation of previously overlooked non-uniformities within the RoPE positional embedding. Existing methods like Position Interpolation (PI), NTK, and YaRN apply interpolation or extrapolation based on fixed rules or frequency groups. LLMpresso empirically identifies two crucial forms of non-uniformity:

  1. Varying RoPE dimensions: Different dimensions of the RoPE vector benefit from different interpolation or extrapolation factors. Lower dimensions (higher frequency) require less interpolation than higher dimensions (lower frequency), but the optimal factors depend on the target extended length.
  2. Token positions: Initial tokens in the sequence benefit from less positional interpolation or even direct extrapolation, as they are critical for attention mechanisms. The optimal number of such tokens (n^\hat{n}) also depends on the target length.

To effectively leverage these non-uniformities, LLMpresso frames the problem as searching for optimal rescale factors for each RoPE dimension (λi\lambda_i) and an optimal threshold for initial tokens (n^\hat{n}). The search space for these factors is vast. LLMpresso employs an evolutionary search algorithm to navigate this complex space efficiently. Key optimizations for the search include:

  • Optimized initial population: Seeding the initial search population with configurations from existing methods (PI, NTK, YaRN) and their mutations.
  • Monotonically non-decreasing constraint: Imposing the constraint that λiλi+1\lambda_i \le \lambda_{i+1} across RoPE dimensions to reduce the search space and align with the NTK theory's observation about frequency dependency.

This search-based approach to non-uniform interpolation offers significant benefits. Empirically, the paper shows that applying the searched non-uniform RoPE factors alone can enable an 8×\times context window extension (e.g., from 4k to 32k) without any fine-tuning, outperforming existing methods that struggle beyond 2×\times.

To achieve the extremely large 2048k context window, LLMpresso proposes a progressive extension strategy:

  1. Initial Search and Fine-tuning: The pre-trained LLM (e.g., LLaMA2-7B or Mistral-7B) is first extended to a moderately long context window (e.g., 128k or 256k) by searching for optimal RoPE factors using the evolutionary algorithm and then fine-tuning the model on datasets chunked to this length.
  2. Secondary Search for Extreme Length: A second evolutionary search is performed on the fine-tuned extended LLM to find new optimal RoPE rescale factors for the target 2048k context window. This step achieves the 2048k extension without requiring further fine-tuning on extremely long texts, which are scarce and computationally expensive. The success of this step relies on the fact that the fine-tuned model at 256k provides a much better starting point for extending to 2048k (an 8×\times extension ratio) compared to extending the original 4k model directly to 2048k (a 512×\times extension ratio).

A known challenge with context window extension via positional interpolation is performance degradation on the original, shorter context lengths. To address this, LLMpresso includes a recovery step:

  • Shorter Context Window Recovery: An additional evolutionary search is conducted on the 2048k-extended model specifically for shorter context lengths (e.g., 4k or 8k). This search encourages less interpolation for these shorter lengths. During inference, the model dynamically adjusts the RoPE rescale factors based on the input sequence length, using the factors optimized for the current length.

The effectiveness of LLMpresso is demonstrated through extensive experiments on LLaMA2-7B and Mistral-7B models.

  • Long Sequence LLMing: LLMpresso-extended models achieve significantly lower perplexity on long document datasets like Proof-pile, PG19, and Books3 compared to baselines (PI, NTK, YaRN) across a wide range of context lengths, including the unprecedented 2048k. The models show consistent performance improvement with increasing context length.
  • Passkey Retrieval: LLMpresso-LLaMA2-2048k (fine-tuned at 256k) maintains high passkey retrieval accuracy (\ge90%) up to 2048k tokens. LLMpresso-Mistral-2048k (fine-tuned at 128k) achieves 100% accuracy up to 1800k tokens. This highlights their practical ability to utilize extremely long contexts for specific tasks.
  • Standard Benchmarks: Evaluated on standard short-context benchmarks (ARC-Challenge, HellaSwag, MMLU, TruthfulQA), LLMpresso-extended models maintain performance comparable to or even slightly better than the original non-extended models and existing long-context baselines, indicating the recovery mechanism is effective at mitigating performance degradation on short inputs.

Ablation studies confirm the contribution of the progressive extension and the two forms of non-uniformities. The secondary search on the fine-tuned model is shown to be crucial for extending beyond the fine-tuning length. The non-uniformity search, particularly considering both dimensions and initial tokens, significantly improves performance compared to linear interpolation and dimension-only search.

In summary, LLMpresso presents a practical and effective approach to extending LLM context windows to unprecedented lengths by intelligently leveraging non-uniformities in RoPE via an evolutionary search and employing an efficient progressive extension strategy. This enables LLMs to process and understand extremely long documents, opening doors for new applications. The method requires only minor modifications to the positional embedding and can reuse existing optimizations, making it highly applicable in practice. The code and models are planned for release to facilitate further research and application.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Long-data collections, 2024. URL https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T.
  2. Amazon. Mistrallite, 2023. URL https://huggingface.co/amazon/MistralLite.
  3. Proof-pile, 2022. URL https://github.com/zhangir-azerbayev/ProofNet.
  4. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
  5. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
  6. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023b.
  7. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
  8. Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  9. Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
  10. Face, H. Open llm leaderboard, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
  11. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  12. Single path one-shot neural architecture search with uniform sampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp.  544–560. Springer, 2020.
  13. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  14. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  15. Fewer is more: Boosting llm reasoning with reinforced context pruning. 2023. URL https://api.semanticscholar.org/CorpusID:266210460.
  16. Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
  17. Mistral 7b, 2023.
  18. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  19. Superscaler: Supporting flexible dnn parallelization via a unified abstraction. arXiv preprint arXiv:2301.08984, 2023.
  20. Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
  21. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023a. URL {https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/}.
  22. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degration, 2023b. URL {https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/}.
  23. Self-refine: Iterative refinement with self-feedback, 2023.
  24. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  25. Gpt-4 technical report, 2023.
  26. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp.  1–22, 2023.
  27. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  28. Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
  29. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
  30. Code llama: Open foundation models for code, 2023.
  31. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  32. Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
  33. Together, 2023. URL https://huggingface.co/togethercomputer/LLaMA-2-7B-32K.
  34. Llama 2: Open foundation and fine-tuned chat models, 2023.
  35. Focused transformer: Contrastive training for context scaling. 2023.
  36. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
  37. Efficient streaming language models with attention sinks. arXiv, 2023.
  38. Effective long-context scaling of foundation models, 2023.
  39. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
  40. Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024.
  41. Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yiran Ding (7 papers)
  2. Li Lyna Zhang (20 papers)
  3. Chengruidong Zhang (11 papers)
  4. Yuanyuan Xu (43 papers)
  5. Ning Shang (8 papers)
  6. Jiahang Xu (14 papers)
  7. Fan Yang (877 papers)
  8. Mao Yang (62 papers)
Citations (92)
Youtube Logo Streamline Icon: https://streamlinehq.com

HackerNews

Reddit Logo Streamline Icon: https://streamlinehq.com