Extending LLMs' Context Window with 100 Samples (2401.07004v1)

Published 13 Jan 2024 in cs.CL

Abstract: LLMs are known to have limited extrapolation ability beyond their pre-trained context window, constraining their application in downstream tasks with lengthy inputs. Recent studies have sought to extend LLMs' context window by modifying rotary position embedding (RoPE), a popular position encoding method adopted by well-known LLMs such as LLaMA, PaLM, and GPT-NeoX. However, prior works like Position Interpolation (PI) and YaRN are resource-intensive and lack comparative experiments to assess their applicability. In this work, we identify the inherent need for LLMs' attention entropy (i.e. the information entropy of attention scores) to maintain stability and introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window. We validate the superiority of our method in both fine-tuning performance and robustness across different context window sizes on various context-demanding tasks. Notably, our method extends the context window of LLaMA-2-7B-Chat to 16,384 with only 100 samples and 6 training steps, showcasing extraordinary efficiency. Finally, we also explore how data compositions and training curricula affect context window extension for specific downstream tasks, suggesting fine-tuning LLMs with lengthy conversations as a good starting point. We release our code and SFT data at https://github.com/GAIR-NLP/Entropy-ABF.

PDF HTML Abstract

Introduction to LLM Context Window Extension

LLMs like GPT-3 have shown exceptional ability in generating coherent and contextually relevant text. However, their capability is inherently constrained by the size of their context window – the amount of text they can consider at any given time. While LLMs are typically pre-trained on a fixed size, real-world applications often require processing much longer texts. This research focuses on overcoming the context window size limitation in LLMs, which is crucial for tasks that demand a broader understanding of context, such as summarizing long documents or maintaining lengthy conversations.

Rotary Position Embedding (RoPE)

A critical aspect of the current generation of LLMs is position encoding, which helps the models understand the order of words. Rotary Position Embedding (RoPE) is a popular method for encoding positional information in state-of-the-art LLMs. It encodes positions into the input embeddings through a rotational transformation in the complex plane. This strategy allows the models to maintain the relative order of words or tokens, an essential factor for generating coherent outputs.

Extending the Context Window

Previous efforts to extend context window sizes of LLMs have been resource-intensive and lack comprehensive comparative analysis. In this work, the researchers present a novel method to extend LLMs' context window beyond the pre-trained limit by adjusting RoPE's base frequency and scaling attention logits, thereby enabling LLMs to adapt to larger context windows more efficiently. They focus on maintaining attention entropy, a measure of the randomness in the distribution of attention scores. The method, termed 'entropy-aware ABF', extends the context window with remarkable efficiency: using only 100 samples and six training steps, it enhanced the window of the model LLaMA-2-7B-Chat to 16,384 tokens. This approach outperforms existing methods across different window sizes and on various context-demanding tasks.

Practical Implications and Dataset Efficiency

This paper's findings have potential implications for the use of LLMs in real-world applications that require handling long texts. Notably, the method demonstrated extraordinary efficiency with minimal training samples, which significantly reduces the computational resources required for model fine-tuning. The researchers also explore optimal training datasets and curricula for specific tasks, providing practical recommendations for extending the context window of LLMs in various applications.

By addressing the performance, robustness across various context window sizes, and resource efficiency, this research makes a significant contribution to enhancing the applicability of LLMs. The released code and supervised fine-tuning (SFT) data further enable replication and adoption of the proposed method by the broader research community.