Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Resonance RoPE: Improving Context Length Generalization of Large Language Models (2403.00071v2)

Published 29 Feb 2024 in cs.CL and cs.AI

Abstract: This paper addresses the challenge of train-short-test-long (TSTL) scenarios in LLMs equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream LLMing tasks and a variety of downstream long-text applications.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. L-eval: Instituting standardized evaluation for long context language models. CoRR, abs/2307.11088.
  2. Zhangir Azerbayev. 2022. zhangir-azerbayev/proof-pile.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. CoRR, abs/2308.14508.
  4. bloc97. 2023. NTK-Aware Scaled RoPE allows LLaMA models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation.
  5. Extending context window of large language models via positional interpolation. CoRR, abs/2306.15595.
  6. Tri Dao. 2023. Flashattention-2: Faster attention with better parallelism and work partitioning. CoRR, abs/2307.08691.
  7. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436. Association for Computational Linguistics.
  8. Mistral 7b. CoRR, abs/2310.06825.
  9. Mixtral of experts. CoRR, abs/2401.04088.
  10. The impact of positional encoding on length generalization in transformers. CoRR, abs/2305.19466.
  11. Transformers learn shortcuts to automata. In The Eleventh International Conference on Learning Representations. OpenReview.net.
  12. Scaling laws of rope-based extrapolation. CoRR, abs/2310.05209.
  13. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In 7th International Conference on Learning Representations. OpenReview.net.
  14. Progress measures for grokking via mechanistic interpretability. In The Eleventh International Conference on Learning Representations. OpenReview.net.
  15. OpenAI. 2023. GPT-4 technical report. CoRR, abs/2303.08774.
  16. Yarn: Efficient context window extension of large language models. CoRR, abs/2309.00071.
  17. Train short, test long: Attention with linear biases enables input length extrapolation. In The Tenth International Conference on Learning Representations. OpenReview.net.
  18. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations. OpenReview.net.
  19. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67.
  20. Zero-offload: Democratizing billion-scale model training. In 2021 USENIX Annual Technical Conference, pages 551–564. USENIX Association.
  21. Code llama: Open foundation models for code. CoRR, abs/2308.12950.
  22. Randomized positional encodings boost length generalization of transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1889–1903, Toronto, Canada. Association for Computational Linguistics.
  23. ZeroSCROLLS: A zero-shot benchmark for long text understanding. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 7977–7989, Singapore. Association for Computational Linguistics.
  24. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063.
  25. Long range arena : A benchmark for efficient transformers. In 9th International Conference on Learning Representations. OpenReview.net.
  26. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971.
  27. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288.
  28. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, pages 5998–6008.
  29. Chain-of-thought prompting elicits reasoning in large language models. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022.
  30. Memorizing transformers. In The Tenth International Conference on Learning Representations. OpenReview.net.
  31. Effective long-context scaling of foundation models. CoRR, abs/2309.16039.
  32. Length extrapolation of transformers: A survey from the perspective of position encoding. CoRR, abs/2312.17044.
  33. Least-to-most prompting enables complex reasoning in large language models. In The Eleventh International Conference on Learning Representations. OpenReview.net.
  34. Pose: Efficient context window extension of llms via positional skip-wise training. CoRR, abs/2309.10400.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Suyuchen Wang (16 papers)
  2. Ivan Kobyzev (23 papers)
  3. Peng Lu (86 papers)
  4. Mehdi Rezagholizadeh (78 papers)
  5. Bang Liu (93 papers)
Citations (8)
Github Logo Streamline Icon: https://streamlinehq.com