Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Laws of RoPE-based Extrapolation (2310.05209v2)

Published 8 Oct 2023 in cs.CL and cs.AI

Abstract: The extrapolation capability of LLMs based on Rotary Position Embedding is currently a topic of considerable interest. The mainstream approach to addressing extrapolation with LLMs involves modifying RoPE by replacing 10000, the rotary base of $\theta_n={10000}{-2n/d}$ in the original RoPE, with a larger value and providing longer fine-tuning text. In this work, we first observe that fine-tuning a RoPE-based LLM with either a smaller or larger base in pre-training context length could significantly enhance its extrapolation performance. After that, we propose \textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework from the periodic perspective, to describe the relationship between the extrapolation performance and base value as well as tuning context length. In this process, we also explain the origin of the RoPE-based extrapolation issue by \textbf{\textit{critical dimension for extrapolation}}. Besides these observations and analyses, we achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Alibaba. Qwen technical report. Technical report, 2023. URL https://qianwen-res.oss-cn-beijing.aliyuncs.com/QWEN_TECHNICAL_REPORT.pdf.
  2. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  3. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  4. Overcoming a theoretical limitation of self-attention. arXiv preprint arXiv:2202.12172, 2022.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  7. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  8. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  9. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  10. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551, 2017.
  11. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  12. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  13. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
  14. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  15. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  16. Codegen: An open large language model for code with multi-turn program synthesis. arXiv preprint arXiv:2203.13474, 2022.
  17. OpenAI. Gpt-4 technical report. Technical report, 2023.
  18. Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882, 2023.
  19. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  20. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  21. Shawn Presser. Books3, 2020. URL https://twitter.com/theshawwn/status/1320282149329784833.
  22. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  23. Parallel context windows improve in-context learning of large language models. arXiv preprint arXiv:2212.10947, 2022.
  24. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  25. Jianlin Su. Nbce: Naive bayes-based context extension, May 2023a.
  26. Jianlin Su. Improving transformer: Length extrapolation ability and position robustness. https://spaces.ac.cn/archives/9444, 2023b.
  27. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  28. A length-extrapolatable transformer. arXiv preprint arXiv:2212.10554, 2022.
  29. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  30. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  31. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  32. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems, 32, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Xiaoran Liu (56 papers)
  2. Hang Yan (86 papers)
  3. Shuo Zhang (256 papers)
  4. Chenxin An (17 papers)
  5. Xipeng Qiu (257 papers)
  6. Dahua Lin (336 papers)
Citations (63)