LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Abstract: Large context window is a desirable feature in LLMs. However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.
- Long-data collections, 2024. URL https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T.
- Amazon. Mistrallite, 2023. URL https://huggingface.co/amazon/MistralLite.
- Proof-pile, 2022. URL https://github.com/zhangir-azerbayev/ProofNet.
- Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pp. 2206–2240. PMLR, 2022.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023a.
- Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023b.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv:1803.05457v1, 2018.
- Computer, T. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
- Dao, T. FlashAttention-2: Faster attention with better parallelism and work partitioning. 2023.
- Face, H. Open llm leaderboard, 2024. URL https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard.
- The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
- Single path one-shot neural architecture search with uniform sampling. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XVI 16, pp. 544–560. Springer, 2020.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Fewer is more: Boosting llm reasoning with reinforced context pruning. 2023. URL https://api.semanticscholar.org/CorpusID:266210460.
- Neural tangent kernel: Convergence and generalization in neural networks. Advances in neural information processing systems, 31, 2018.
- Mistral 7b, 2023.
- Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
- Superscaler: Supporting flexible dnn parallelization via a unified abstraction. arXiv preprint arXiv:2301.08984, 2023.
- Scaling laws of rope-based extrapolation. arXiv preprint arXiv:2310.05209, 2023.
- LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, 2023a. URL {https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/}.
- LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degration, 2023b. URL {https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/}.
- Self-refine: Iterative refinement with self-feedback, 2023.
- Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
- Gpt-4 technical report, 2023.
- Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pp. 1–22, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Compressive transformers for long-range sequence modelling. arXiv preprint, 2019. URL https://arxiv.org/abs/1911.05507.
- Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
- Code llama: Open foundation models for code, 2023.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Fourier features let networks learn high frequency functions in low dimensional domains. Advances in Neural Information Processing Systems, 33:7537–7547, 2020.
- Together, 2023. URL https://huggingface.co/togethercomputer/LLaMA-2-7B-32K.
- Llama 2: Open foundation and fine-tuned chat models, 2023.
- Focused transformer: Contrastive training for context scaling. 2023.
- Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
- Efficient streaming language models with attention sinks. arXiv, 2023.
- Effective long-context scaling of foundation models, 2023.
- Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019.
- Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024.
- Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.