CLEX: Continuous Length Extrapolation for Large Language Models (2310.16450v3)

Published 25 Oct 2023 in cs.CL

Abstract: Transformer-based LLMs are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.

PDF Abstract

An Analysis of Continuous Length Extrapolation for LLMs

Transformer-based LLMs have set the benchmark for numerous NLP tasks. However, these models are inherently limited by the preset context windows of the Transformer architecture, which constrains their performance on tasks demanding longer-context dependencies. Addressing these limitations, the paper presents Continuous Length Extrapolation (CLEX) as an innovative approach to extending the context length capabilities of LLMs in a computationally efficient manner.

Overview of Existing Methods

The paper identifies two primary methods used to extend the context length in LLMs: Position Embedding (PE) scaling and length extrapolation methods. PE scaling methods, such as those using Rotary Position Embedding (RoPE), allow for the extension of context windows by manipulating either position indices or frequency basis within the PE. Despite their efficacy, these methods are typically restricted to fixed scaling factors leading to performance restrictions on sequences longer than those they were trained on.

Length extrapolation methods, such as ALiBi, attempt to extend context length by incorporating additional biases in the attention scores. However, these approaches often fall short in practical applications due to their limited capacity to handle tasks that require long-context dependency.

Continuous Length Extrapolation (CLEX)

CLEX transitions from the discrete factor scaling seen in PE methods to a continuous approach using ordinary differential equations (ODEs). By framing PE scaling as a continuous dynamical system, CLEX models the transitions of frequency basis as a function of the length scaling factor. This approach allows LLMs to seamlessly adapt to extended contexts beyond their training sequences, offering enhanced performance without increased training or inference latency.

Experimental results support the efficacy of CLEX. The method achieves a remarkable context window extension, scaling over four times the training length without any degradation in performance. A significant numerical finding of this research is that CLEX-trained models on a 4k sequence length can robustly perform on a 16k evaluation length, demonstrating superior competitiveness against state-of-the-art models trained on lengths up to 32k.

Implications and Future Directions

CLEX advances both the practical and theoretical understanding of context length extensions in LLMs. Practically, it provides an efficient mechanism for existing LLMs to enhance their context handling capabilities, necessary for tasks such as summarization, question-answering, and few-shot learning on long documents. Theoretically, CLEX elucidates the potential for neural ODEs in continuous dynamical system modeling within NLP, suggesting exciting future areas for exploration.

Despite its promising results, continuous length extrapolation raises new challenges and questions. Particularly, exploring the scalability limits concerning context length and model size, as well as the fine-tuning required to adapt fully trained models to benefit from context extensions as efficiently as new models.

In conclusion, the paper offers a significant contribution to the domain of NLP with CLEX, shifting the emphasis from fixed-length context expansions to a more adaptable, continuous approach. This work paves the way for the further evolution of LLMs capable of more efficiently processing and understanding extensive textual contexts, expanding their applicability across more complex and demanding tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Guanzheng Chen (9 papers)
Xin Li (980 papers)
Zaiqiao Meng (42 papers)
Shangsong Liang (23 papers)
Lidong Bing (144 papers)

Citations (23)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - DAMO-NLP-SG/CLEX: [ICLR 2024] CLEX: Continuous Length Extrapolation for Large Language Models (72 stars)

Tweets

YouTube

Show All Videos