YaRN: Efficient Context Window Extension of Large Language Models (2309.00071v2)

Published 31 Aug 2023 in cs.CL, cs.AI, and cs.LG

Abstract: Rotary Position Embeddings (RoPE) have been shown to effectively encode positional information in transformer-based LLMs. However, these models fail to generalize past the sequence length they were trained on. We present YaRN (Yet another RoPE extensioN method), a compute-efficient method to extend the context window of such models, requiring 10x less tokens and 2.5x less training steps than previous methods. Using YaRN, we show that LLaMA models can effectively utilize and extrapolate to context lengths much longer than their original pre-training would allow, while also surpassing previous the state-of-the-art at context window extension. In addition, we demonstrate that YaRN exhibits the capability to extrapolate beyond the limited context of a fine-tuning dataset. The models fine-tuned using YaRN has been made available and reproduced online up to 128k context length at https://github.com/jquesnelle/yarn

PDF Abstract

YaRN: An Efficient Method for Extending the Context Window of RoPE-Enabled LLMs

Introduction to RoPE and Its Limitations

LLMs have significantly benefited from the introduction of Rotary Position Embeddings (RoPE), which encode positional information more effectively in transformer architectures. While RoPE has been pivotal in enhancing the capabilities of these models, a notable limitation is their failure to generalize to sequence lengths beyond their training confines. Addressing this, we examine the recent development of YaRN (Yet another RoPE extensioN method), a novel approach that proposes a compute-efficient solution to extend the contextual reach of RoPE-enabled LLMs without demanding extensive training resources.

Background on Positional Encoding and Existing Solutions

Positional encoding has evolved through various iterations, moving from absolute to relative schemes, with RoPE emerging as a popular choice due to its ability to manage relative distances effectively. Despite its advantages, extending the context window beyond the pre-trained threshold remained a challenge. Prior efforts, such as Position Interpolation (PI) and "NTK-aware" interpolation, made strides towards solving this issue but were either limited in scalability or demanded substantial fine-tuning resources.

Within this landscape, YaRN emerges as a comprehensive solution that builds upon the foundations of RoPE and earlier interpolation efforts. By incorporating techniques like "NTK-by-parts" and Dynamic Scaling, YaRN presents a method that significantly reduces the necessity for large training datasets and computational steps while effectively extending the context window.

Unveiling YaRN: Methodology and Advantages

YaRN introduces a multi-faceted approach to address the challenges of context window extension. At its core, YaRN distinguishes itself by employing a targeted interpolation method that respects the inherent frequency components of RoPE, unlike the "blind" methods of its predecessors. This nuanced consideration of frequency information ensures that both high and low-frequency details are aptly scaled, preserving the model's ability to discern close and distant relations within the text cohesively.

Additionally, the integration of Dynamic Scaling allows YaRN to adjust the scale factor in real-time during inference, promoting model robustness across varying sequence lengths. This is a leap forward, enabling models to maintain performance without the abrupt degradation observed in previous methods when reaching the pre-trained context limit.

Empirical Validation and Practical Implications

Through exhaustive experimentation with LLaMA among other models, YaRN has proven its prowess in extending context windows far beyond existing benchmarks, doing so with remarkably lower training requirements. Specifically, YaRN brought about state-of-the-art performance in context window extensions with 10x fewer tokens and a 2.5x reduction in training steps compared to its closest competitors. These improvements not only signify a breakthrough in the efficiency of training but also in the applicability of LLMs to tasks involving significantly longer sequences.

Beyond its numerical successes, YaRN's methodological contributions suggest a promising avenue for future research into embedding interpolation and model generalization over extended contexts. It opens the door to more computationally efficient and scalable solutions for managing large sequence lengths - a critical frontier for advancement in natural language processing and understanding.

Looking Ahead: YaRN and the Future of LLMs

YaRN’s introduction represents a pivotal advancement in the manipulation of context windows for LLMs. Its ability to efficiently extend context sizes poses significant implications for both the theoretical understanding and practical utilization of transformer architectures. This approach not only enhances current models' capabilities but also lays foundational knowledge that could inspire future innovations in LLM development. As we continue to explore the boundaries and applications of LLMs, techniques like YaRN will be instrumental in shaping the next generation of AI-driven language understanding and generation.