Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding (2312.17044v5)

Published 28 Dec 2023 in cs.CL

Abstract: Built upon the Transformer, LLMs have captured worldwide attention due to their remarkable abilities. Nevertheless, all Transformer-based models including LLMs suffer from a preset length limit and can hardly generalize from short training sequences to longer inference ones, namely, they cannot perform length extrapolation to handle long sequences, which severely hinders their application in scenarios demanding long input sequences such as legal or scientific documents. Thus, numerous methods have emerged to enhance the length extrapolation of Transformers. Despite the great research efforts, a systematic survey is still lacking. To fill this gap, we delve into these advances in a unified notation from the perspective of positional encoding (PE), as it has been considered the primary factor on length extrapolation. Specifically, we begin with extrapolatable PEs that have dominated this research field. Then, we dive into extrapolation methods based on them, covering position interpolation and randomized position methods. Finally, several challenges and future directions in this area are highlighted. Through this survey, we aim to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. Exploring Length Generalization in Large Language Models. Advances in Neural Information Processing Systems, 35:38546–38556.
  2. PaLM 2 Technical Report. ArXiv:2305.10403 [cs].
  3. Layer Normalization. CoRR, abs/1607.06450. ArXiv: 1607.06450.
  4. Qwen Technical Report. ArXiv:2309.16609 [cs].
  5. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  6. Extending Context Window of Large Language Models via Positional Interpolation. ArXiv:2306.15595 [cs].
  7. KERPLE: Kernelized Relative Positional Embedding for Length Extrapolation. Advances in Neural Information Processing Systems, 35:8386–8399.
  8. Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13522–13537, Toronto, Canada. Association for Computational Linguistics.
  9. Noam Chomsky. 1957. Syntactic structures.
  10. Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling. ArXiv:1412.3555 [cs].
  11. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2978–2988, Florence, Italy. Association for Computational Linguistics.
  12. Neural Networks and the Chomsky Hierarchy.
  13. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  14. The Efficiency Spectrum of Large Language Models: An Algorithmic Survey. ArXiv:2312.00678 [cs].
  15. Position Information in Transformers: An Overview. Computational Linguistics, 48(3):733–763. Place: Cambridge, MA Publisher: MIT Press.
  16. A Practical Survey on Faster and Lighter Transformers. ACM Computing Surveys, 55(14s):304:1–304:40.
  17. LM-Infinite: Simple On-the-Fly Length Generalization for Large Language Models. ArXiv:2308.16137 [cs].
  18. Deep Residual Learning for Image Recognition. pages 770–778.
  19. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural Computation, 9(8):1735–1780. Conference Name: Neural Computation.
  20. Advancing Transformer Architecture in Long-Context Large Language Models: A Comprehensive Survey.
  21. Improve Transformer Models with Better Relative Position Embeddings. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 3327–3335, Online. Association for Computational Linguistics.
  22. Neural Tangent Kernel: Convergence and Generalization in Neural Networks. In Advances in Neural Information Processing Systems, volume 31. Curran Associates, Inc.
  23. Mistral 7B. ArXiv:2310.06825 [cs].
  24. The Impact of Positional Encoding on Length Generalization in Transformers. ArXiv:2305.19466 [cs].
  25. Rethinking Positional Encoding in Language Pre-training.
  26. SHAPE: Shifted Absolute Position Embedding for Transformers. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3309–3321, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  27. Brenden Lake and Marco Baroni. 2018. Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. In Proceedings of the 35th International Conference on Machine Learning, pages 2873–2882. PMLR. ISSN: 2640-3498.
  28. Functional Interpolation for Relative Positions Improves Long Context Transformers. ArXiv:2310.04418 [cs].
  29. CAPE: Encoding Relative Positions with Continuous Augmented Positional Embeddings. In Advances in Neural Information Processing Systems, volume 34, pages 16079–16092. Curran Associates, Inc.
  30. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. ACM Computing Surveys, 55(9):1–35.
  31. Learning to Encode Position for Transformer with Continuous Dynamical Model. In Proceedings of the 37th International Conference on Machine Learning, pages 6327–6335. PMLR. ISSN: 2640-3498.
  32. RICHARD MONTAGUE. 1970. Universal grammar. Theoria, 36(3):373–398.
  33. Masato Neishi and Naoki Yoshinaga. 2019. On the Relation between Position Information and Sentence Length in Neural Machine Translation. In Proceedings of the 23rd Conference on Computational Natural Language Learning (CoNLL), pages 328–338, Hong Kong, China. Association for Computational Linguistics.
  34. OpenAI. 2023. GPT-4 Technical Report. ArXiv:2303.08774 [cs].
  35. Giraffe: Adventures in Expanding Context Lengths in LLMs. ArXiv:2308.10882 [cs].
  36. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY, USA. Association for Computing Machinery.
  37. YaRN: Efficient Context Window Extension of Large Language Models. ArXiv:2309.00071 [cs].
  38. Train Short, Test Long: Attention with Linear Biases Enables Input Length Extrapolation.
  39. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):140:5485–140:5551.
  40. Code Llama: Open Foundation Models for Code. ArXiv:2308.12950 [cs].
  41. Schemata and sequential thought processes in pdp models. In Parallel distributed processing: explorations in the microstructure, vol. 2: psychological and biological models, pages 7–57.
  42. Randomized Positional Encodings Boost Length Generalization of Transformers. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1889–1903, Toronto, Canada. Association for Computational Linguistics.
  43. Self-Attention with Relative Position Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pages 464–468, New Orleans, Louisiana. Association for Computational Linguistics.
  44. The Curious Case of Absolute Position Embeddings. ArXiv:2210.12574 [cs].
  45. RoFormer: Enhanced transformer with Rotary Position Embedding. Neurocomputing, 568:127063.
  46. A Length-Extrapolatable Transformer. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14590–14604, Toronto, Canada. Association for Computational Linguistics.
  47. Efficient Transformers: A Survey. ACM Computing Surveys, 55(6):109:1–109:28.
  48. Scale Efficiently: Insights from Pretraining and Finetuning Transformers.
  49. LLaMA: Open and Efficient Foundation Language Models. ArXiv:2302.13971 [cs].
  50. Llama 2: Open Foundation and Fine-Tuned Chat Models. ArXiv:2307.09288 [cs].
  51. Attention is All you Need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc.
  52. Efficient Large Language Models: A Survey. ArXiv:2312.03863 [cs].
  53. On Position Embeddings in BERT.
  54. Encoding word order in complex embeddings.
  55. The Fine Line between Linguistic Generalization and Failure in Seq2Seq-Attention Models. In Proceedings of the Workshop on Generalization in the Age of Deep Learning, pages 24–27, New Orleans, Louisiana. Association for Computational Linguistics.
  56. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Advances in Neural Information Processing Systems, 35:24824–24837.
  57. DA-Transformer: Distance-aware Transformer. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2059–2068, Online. Association for Computational Linguistics.
  58. Efficient Streaming Language Models with Attention Sinks. ArXiv:2309.17453 [cs].
  59. PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization. In Proceedings of the 37th International Conference on Machine Learning, pages 11328–11339. PMLR. ISSN: 2640-3498.
  60. PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training. ArXiv:2309.10400 [cs].
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Liang Zhao (353 papers)
  2. Xiaocheng Feng (54 papers)
  3. Xiachong Feng (28 papers)
  4. Ting Liu (329 papers)
  5. Bing Qin (186 papers)
  6. Dongliang Xu (19 papers)
  7. Qing Yang (138 papers)
  8. Hongtao Liu (44 papers)
  9. Weihong Zhong (15 papers)
Citations (3)