Efficient Extension of Context Window Sizes in Pretrained LLMs via Position Interpolation
Introduction
Expanding the context window sizes of LLMs, including the increasingly utilized LLaMA models, presents a computational and logistical challenge, particularly when aiming to embrace applications that necessitate processing extensive sequences. Traditional methodologies to enhance context window sizes involve extensive retraining, often requiring substantial compute resources. This paper introduces Position Interpolation (PI), a novel approach that enables the extension of context window sizes in RoPE-based pretrained LLMs, such as LLaMA, to unprecedented lengths (up to 32768 tokens), with minimal fine-tuning. Remarkably, models employing PI demonstrate proficiency in tasks demanding prolonged context, including LLMing and document summarization, while effectively preserving performance on tasks within the original context limit.
Methodology
Extended Context via Position Interpolation (PI)
The essence of PI lies in down-scaling input position indices to fit within the pre-existing context window limits of a model, thus bypassing the limitations of direct extrapolation methods, which have been shown to result in unstable attention scores. By interpolating the position encodings at neighboring integer positions, PI ensures that extended models can adapt to longer contexts with greater stability. This method retains the original architecture of the models, allowing for most pre-existing optimizations and infrastructure to be reused effectively.
Theoretical Underpinnings and Empirical Validation
The theoretical investigation of PI unveils that the upper bound of interpolated attention score is substantially smaller (~600 times in the context of the LLaMA 7B model) than that of extrapolated attention scores, establishing the stability of this method. Empirically, LLaMA models extended via PI exhibit improved perplexity on long context tasks, thereby validating the theoretical propositions.
Experimental Results
Extending the LLaMA models' context window to sizes up to 32768 via PI requires minimal fine-tuning steps (~1000 steps) on the Pile dataset. This process incurs negligible costs compared to pre-training efforts. The extended models, through various context window sizes, demonstrated not only proficiency in handling tasks requiring extended contexts but also maintained relative performance on tasks designed for shorter contexts.
In particular, for tasks such as LLMing and long document summarization, the models showed significant gains in perplexity and competitive performance, respectively, when evaluated against established benchmarks. The ability of models extended via PI to rapidly adapt to longer sequences during fine-tuning stages was notably demonstrated through a synthetic evaluation task of passkey retrieval, suggesting these models can effectively utilize the extended context window.
Implications and Future Directions
The introduction of Position Interpolation as a method to extend context windows of LLMs paves the way for broader application of these models without the need for extensive retraining or architectural modifications. The paper's findings shed light on the inherent flexibility of Transformer models to adapt to extended sequences, thus potentially expanding the horizon for LLM applications in processing long documents or conducting extended conversations. Looking forward, the application of PI to models with different types of positional encodings could further diversify its utility across various LLM architectures, making it a universal tool for context window extension. This research also opens up avenues for exploring other methods of reducing interpolation/extrapolation bounds, which can enrich the existing toolkit for enhancing the capacity of LLMs.
Conclusion
Position Interpolation presents an efficient and theoretically grounded method for extending the context window sizes of pretrained LLMs with minimal fine-tuning. Its practicality, coupled with the ability to reuse existing infrastructure, positions PI as an attractive solution for leveraging the capabilities of LLMs across a wider range of applications that require processing long sequences of text.