Length Generalization of Causal Transformers without Position Encoding (2404.12224v2)

Published 18 Apr 2024 in cs.CL

Abstract: Generalizing to longer sentences is important for recent Transformer-based LLMs. Besides algorithms manipulating explicit position features, the success of Transformers without position encodings (NoPE) provides a new way to overcome the challenge. In this paper, we study the length generalization property of NoPE. We find that although NoPE can extend to longer sequences than the commonly used explicit position encodings, it still has a limited context length. We identify a connection between the failure of NoPE's generalization and the distraction of attention distributions. We propose a parameter-efficient tuning for searching attention heads' best temperature hyper-parameters, which substantially expands NoPE's context size. Experiments on long sequence LLMing, the synthetic passkey retrieval task and real-world long context tasks show that NoPE can achieve competitive performances with state-of-the-art length generalization algorithms. The source code is publicly accessible

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (42)

Authors (8)

Jie Wang (480 papers)
Tao Ji (28 papers)
Yuanbin Wu (47 papers)
Hang Yan (86 papers)
Tao Gui (127 papers)
Qi Zhang (784 papers)
Xuanjing Huang (287 papers)
Xiaoling Wang (42 papers)

Citations (8)

View on Semantic Scholar

Tweets

https://twitter.com/swyx/status/1909773478636683660

https://twitter.com/sanju_lu/status/1876814443344019950

Length Generalization of Causal Transformers without Position Encoding (2404.12224v2)

Related Papers

Tweets