Why Does the Effective Context Length of LLMs Fall Short? (2410.18745v1)

Published 24 Oct 2024 in cs.CL

Abstract: Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of LLMs. However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

References (84)

Citations (1)

View on Semantic Scholar

Summary

The paper finds that LLMs’ effective context length is limited by a left-skewed frequency distribution of training position indices.
It introduces the StRing method, which shifts well-trained indices to enhance long-range dependency modeling without extra training.
Experimental benchmarks show a 10-point performance improvement, outperforming models like GPT-4 and Claude 2 on challenging tasks.

Understanding Context Length Limitations in LLMs

The paper "Why Does the Effective Context Length of LLMs Fall Short?" by Chenxin An et al. investigates the discrepancy between the theoretical and realized context lengths in LLMs. The primary insight is that this shortfall is due to a left-skewed frequency distribution of position indices during training, which underrepresents long-range dependencies.

Core Contributions

Position Frequency Analysis: The authors highlight that the uneven distribution of relative position frequencies, which emphasizes local dependencies over distant ones, substantially impacts the effective context length in LLMs. This skewness is evident when analyzing models like Llama3.1, where effective context lengths are observed to fall below 50% of their advertised capabilities.
Shifted RoPE (StRing): To address the identified limitation, the authors introduce a novel approach called StRing. This method shifts well-trained position indices into regions typically mapped to infrequent, long-range indices during inference, effectively enhancing performance without additional training.
Experimental Validation: Utilizing benchmark tests such as Needle-in-a-Haystack, RULER, and InfiniteBench, the authors demonstrate StRing's effectiveness in significantly improving performance metrics over state-of-the-art models, outperforming both open-source and commercial alternatives like GPT-4 and Claude 2.

Key Findings

Theoretical vs. Practical Context Utilization: The paper provides empirical evidence that LLMs use only a fraction of their potential context length due to the left-skewed distribution of training position indices. This insight emphasizes the need to reconsider how training data is structured and analyzed.
StRing Effectiveness: With minimal computational adjustment, StRing notably enhances long-range dependency modeling, achieving a 10-point improvement in benchmarks compared to baseline models. This finding suggests that revisiting position encoding strategies can yield significant advancements in LLM performance.

Implications and Future Directions

The work points to both theoretical and practical implications. Theoretically, it challenges existing assumptions about position encoding and context utilization. Practically, it offers a scalable solution to enhance LLMs by optimizing the existing models without extensive retraining. Future research might explore variations of position distributions during training or integrating StRing-like methods into other architectures and tasks.

The paper opens avenues for refining LLM design and training, potentially leading to more robust models capable of fully utilizing their training context. As researchers explore these directions, the insights from this paper affirm the importance of considering position encoding mechanisms in advancing the capabilities of LLMs.