Papers
Topics
Authors
Recent
2000 character limit reached

Why Does the Effective Context Length of LLMs Fall Short? (2410.18745v1)

Published 24 Oct 2024 in cs.CL

Abstract: Advancements in distributed training and efficient attention mechanisms have significantly expanded the context window sizes of LLMs. However, recent work reveals that the effective context lengths of open-source LLMs often fall short, typically not exceeding half of their training lengths. In this work, we attribute this limitation to the left-skewed frequency distribution of relative positions formed in LLMs pretraining and post-training stages, which impedes their ability to effectively gather distant information. To address this challenge, we introduce ShifTed Rotray position embeddING (STRING). STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that without additional training, STRING dramatically improves the performance of the latest large-scale models, such as Llama3.1 70B and Qwen2 72B, by over 10 points on popular long-context benchmarks RULER and InfiniteBench, establishing new state-of-the-art results for open-source LLMs. Compared to commercial models, Llama 3.1 70B with \method even achieves better performance than GPT-4-128K and clearly surpasses Claude 2 and Kimi-chat.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  2. Training-free long-context scaling of large language models, 2024a.
  3. Make your llm fully utilize the context, 2024b. URL https://arxiv.org/abs/2404.16811.
  4. Anthropic. Introducing 100K Context Windows, 2023. URL https://www.anthropic.com/index/100k-context-windows.
  5. Qwen technical report, 2023.
  6. Longalign: A recipe for long context alignment of large language models, 2024. URL https://arxiv.org/abs/2401.18058.
  7. Codeplan: Repository-level coding using llms and planning, 2023. URL https://arxiv.org/abs/2309.12499.
  8. Unilmv2: Pseudo-masked language models for unified language model pre-training, 2020. URL https://arxiv.org/abs/2002.12804.
  9. Longformer: The long-document transformer, 2020. URL https://arxiv.org/abs/2004.05150.
  10. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. ArXiv, abs/2406.02069, 2024. URL https://api.semanticscholar.org/CorpusID:270226243.
  11. Cerebras. Slimpajama: A 627b token, cleaned and deduplicated version of redpajama, 2023. URL https://cerebras.ai/blog/slimpajama-a-627b-token-cleaned-and-deduplicated-version-of-redpajama.
  12. Clex: Continuous length extrapolation for large language models, 2023.
  13. Fortify the shortest stave in attention: Enhancing context awareness of large language models for effective tool use, 2024. URL https://arxiv.org/abs/2312.04455.
  14. Transformer-xl: Attentive language models beyond a fixed-length context, 2019.
  15. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  16. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  17. Longnet: Scaling transformers to 1,000,000,000 tokens, 2023.
  18. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. ArXiv, abs/2402.09398, 2024. URL https://api.semanticscholar.org/CorpusID:267657553.
  19. Moa: Mixture of sparse attention for automatic large language model compression. ArXiv, abs/2406.14909, 2024a. URL https://api.semanticscholar.org/CorpusID:270688596.
  20. Data engineering for scaling language models to 128k context, 2024b. URL https://arxiv.org/abs/2402.10171.
  21. Quest: Query-centric data synthesis approach for long-context scaling of large language model. ArXiv, abs/2405.19846, 2024. URL https://api.semanticscholar.org/CorpusID:270123337.
  22. Openllama: An open reproduction of llama, May 2023. URL https://github.com/openlm-research/open_llama.
  23. gkamradt. Llmtest_needleinahaystack: Doing simple retrieval from llm models. https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023. [Online; accessed 29-December-2023].
  24. Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
  25. Lm-infinite: Simple on-the-fly length generalization for large language models, 2023.
  26. Lm-infinite: Zero-shot extreme length generalization for large language models. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  3991–4008, 2024.
  27. Ruler: What’s the real context size of your long-context language models?, 2024. URL https://arxiv.org/abs/2404.06654.
  28. Longrecipe: Recipe for efficient long context generalization in large language models, 2024. URL https://arxiv.org/abs/2409.00509.
  29. Qwen2. 5-coder technical report. arXiv preprint arXiv:2409.12186, 2024.
  30. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention, 2024. URL https://arxiv.org/abs/2407.02490.
  31. Llm maybe longlm: Self-extend llm context window without tuning, 2024.
  32. Scaling laws for neural language models, 2020. URL https://arxiv.org/abs/2001.08361.
  33. xformers: A modular and hackable transformer modelling library. https://github.com/facebookresearch/xformers, 2022.
  34. Distflashattn: Distributed memory-efficient attention for long-context llms training, 2024a. URL https://arxiv.org/abs/2310.03294.
  35. Long-context llms struggle with long in-context learning, 2024b. URL https://arxiv.org/abs/2404.02060.
  36. Jamba: A hybrid transformer-mamba language model. arXiv preprint arXiv:2403.19887, 2024.
  37. Infinite-llm: Efficient llm service for long context with distattention and distributed kvcache. ArXiv, abs/2401.02669, 2024a. URL https://api.semanticscholar.org/CorpusID:266818470.
  38. Mixture of in-context experts enhance llms’ long context awareness. ArXiv, abs/2406.19598, 2024b. URL https://api.semanticscholar.org/CorpusID:270845965.
  39. Ring attention with blockwise transformers for near-infinite context, 2023. URL https://arxiv.org/abs/2310.01889.
  40. World model on million-length video and language with ringattention. arXiv preprint, 2024a.
  41. Farewell to length extrapolation, a training-free infinite context with finite attention scope. ArXiv, abs/2407.15176, 2024b. URL https://api.semanticscholar.org/CorpusID:271328963.
  42. Swin transformer: Hierarchical vision transformer using shifted windows, 2021. URL https://arxiv.org/abs/2103.14030.
  43. Llama Team. The llama 3 herd of models. CoRR, abs/2407.21783, 2024. doi: 10.48550/ARXIV.2407.21783. URL https://doi.org/10.48550/arXiv.2407.21783.
  44. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
  45. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  46. Decoupled weight decay regularization, 2019. URL https://arxiv.org/abs/1711.05101.
  47. Longwanjuan: Towards systematic measurement for long text quality, 2024.
  48. 3d-rpe: Enhancing long-context modeling through 3d rotary position encoding. ArXiv, abs/2406.09897, 2024. URL https://api.semanticscholar.org/CorpusID:270521302.
  49. Base of rope bounds context length. ArXiv, abs/2405.14591, 2024. URL https://api.semanticscholar.org/CorpusID:269983770.
  50. Mistral.AI. La plateforme, 2024. URL https://mistral.ai/news/la-plateforme/.
  51. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  52. Moonshot AI. Kimi chat. https://kimi.moonshot.cn/, 2023.
  53. OpenAI. Gpt-4 technical report, 2023.
  54. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5336–5358, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.391. URL https://aclanthology.org/2022.naacl-main.391.
  55. Yarn: Efficient context window extension of large language models, 2023.
  56. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  57. Improving language understanding by generative pre-training. 2018.
  58. Exploring the limits of transfer learning with a unified text-to-text transformer, 2023. URL https://arxiv.org/abs/1910.10683.
  59. Zebra: Extending context window with layerwise grouped local-global attention, 2023.
  60. Jianlin Su. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
  61. Roformer: Enhanced transformer with rotary position embedding, 2022.
  62. A length-extrapolatable transformer, 2022.
  63. Llama: Open and efficient foundation language models, 2023a.
  64. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  65. Attention is all you need, 2017.
  66. Leave no document behind: Benchmarking long-context llms with extended multi-doc qa, 2024a. URL https://arxiv.org/abs/2406.17419.
  67. Resonance rope: Improving context length generalization of large language models. In Annual Meeting of the Association for Computational Linguistics, 2024b. URL https://api.semanticscholar.org/CorpusID:268201728.
  68. Long context alignment with short instructions and synthesized positions, 2024. URL https://arxiv.org/abs/2405.03939.
  69. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory, 2024.
  70. Efficient streaming language models with attention sinks, 2023.
  71. Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023. doi: 10.48550/ARXIV.2309.16039. URL https://doi.org/10.48550/arXiv.2309.16039.
  72. Post-training sparse attention with double sparsity. ArXiv, abs/2408.07092, 2024. URL https://api.semanticscholar.org/CorpusID:271865443.
  73. Remamba: Equip mamba with effective long-sequence modeling. arXiv preprint arXiv:2408.15496, 2024.
  74. Soaring from 4k to 400k: Extending llm’s context with activation beacon. ArXiv, abs/2401.03462, 2024a. URL https://api.semanticscholar.org/CorpusID:266844488.
  75. Tinyllama: An open-source small language model, 2024b.
  76. Long context transfer from language to vision. arXiv preprint arXiv:2406.16852, 2024c. URL https://arxiv.org/abs/2406.16852.
  77. ∞\infty∞bench: Extending long context evaluation beyond 100k tokens, 2024d. URL https://arxiv.org/abs/2402.13718.
  78. Found in the middle: How language models use long contexts better via plug-and-play positional encoding. ArXiv, abs/2403.04797, 2024e. URL https://api.semanticscholar.org/CorpusID:268296885.
  79. Longskywork: A training recipe for efficiently extending context length in large language models, 2024. URL https://arxiv.org/abs/2406.00605.
  80. Dape: Data-adaptive positional encoding for length extrapolation, 2024. URL https://arxiv.org/abs/2405.14722.
  81. Understanding the rope extensions of long-context llms: An attention perspective. ArXiv, abs/2406.13282, 2024. URL https://api.semanticscholar.org/CorpusID:270620800.
  82. Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
  83. Longembed: Extending embedding models for long context retrieval. ArXiv, abs/2404.12096, 2024a. URL https://api.semanticscholar.org/CorpusID:269214659.
  84. Sampleattention: Near-lossless acceleration of long context llm inference with adaptive structured sparse attention, 2024b. URL https://arxiv.org/abs/2406.15486.
Citations (1)

Summary

  • The paper finds that LLMs’ effective context length is limited by a left-skewed frequency distribution of training position indices.
  • It introduces the StRing method, which shifts well-trained indices to enhance long-range dependency modeling without extra training.
  • Experimental benchmarks show a 10-point performance improvement, outperforming models like GPT-4 and Claude 2 on challenging tasks.

Understanding Context Length Limitations in LLMs

The paper "Why Does the Effective Context Length of LLMs Fall Short?" by Chenxin An et al. investigates the discrepancy between the theoretical and realized context lengths in LLMs. The primary insight is that this shortfall is due to a left-skewed frequency distribution of position indices during training, which underrepresents long-range dependencies.

Core Contributions

  1. Position Frequency Analysis: The authors highlight that the uneven distribution of relative position frequencies, which emphasizes local dependencies over distant ones, substantially impacts the effective context length in LLMs. This skewness is evident when analyzing models like Llama3.1, where effective context lengths are observed to fall below 50% of their advertised capabilities.
  2. Shifted RoPE (StRing): To address the identified limitation, the authors introduce a novel approach called StRing. This method shifts well-trained position indices into regions typically mapped to infrequent, long-range indices during inference, effectively enhancing performance without additional training.
  3. Experimental Validation: Utilizing benchmark tests such as Needle-in-a-Haystack, RULER, and InfiniteBench, the authors demonstrate StRing's effectiveness in significantly improving performance metrics over state-of-the-art models, outperforming both open-source and commercial alternatives like GPT-4 and Claude 2.

Key Findings

  • Theoretical vs. Practical Context Utilization: The paper provides empirical evidence that LLMs use only a fraction of their potential context length due to the left-skewed distribution of training position indices. This insight emphasizes the need to reconsider how training data is structured and analyzed.
  • StRing Effectiveness: With minimal computational adjustment, StRing notably enhances long-range dependency modeling, achieving a 10-point improvement in benchmarks compared to baseline models. This finding suggests that revisiting position encoding strategies can yield significant advancements in LLM performance.

Implications and Future Directions

The work points to both theoretical and practical implications. Theoretically, it challenges existing assumptions about position encoding and context utilization. Practically, it offers a scalable solution to enhance LLMs by optimizing the existing models without extensive retraining. Future research might explore variations of position distributions during training or integrating StRing-like methods into other architectures and tasks.

The paper opens avenues for refining LLM design and training, potentially leading to more robust models capable of fully utilizing their training context. As researchers explore these directions, the insights from this paper affirm the importance of considering position encoding mechanisms in advancing the capabilities of LLMs.

Whiteboard

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 6 likes about this paper.