Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Training-Free Long-Context Scaling of Large Language Models (2402.17463v2)

Published 27 Feb 2024 in cs.CL
Training-Free Long-Context Scaling of Large Language Models

Abstract: The ability of LLMs to process and generate coherent text is markedly weakened when the number of input tokens exceeds their pretraining length. Given the expensive overhead of finetuning large-scale models with longer sequences, we propose Dual Chunk Attention (DCA), which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention computation for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens within the same chunk (Intra-Chunk) and across distinct chunks (Inter-Chunk), as well as integrates seamlessly with Flash Attention. In addition to its impressive extrapolation capability, DCA achieves performance on practical long-context tasks that is comparable to or even better than that of finetuned models. When compared with proprietary models, our training-free 70B model attains 94% of the performance of gpt-3.5-16k, indicating it is a viable open-source alternative. All code and data used in this work are released at \url{https://github.com/HKUNLP/ChunkLlama}.

Training-Free Long-Context Scaling of LLMs: Dual Chunk Attention

The paper, authored by Chenxin An and colleagues, introduces Dual Chunk Attention (DCA), an innovative approach designed to scale the context window of LLMs without necessitating additional training. This work focuses on expanding the effective context length of models such as Llama2, allowing them to consistently process and generate text for sequences exceeding their original training limits. The proposed method is particularly notable for enabling Llama2 70B to handle context windows of over 100k tokens, directly addressing the limitations imposed by the pretraining context length.

Introduction

At the core of this research is the challenge of maintaining coherence and processing efficiency in LLMs when dealing with long-context inputs. Existing LLMs are typically pretrained with a fixed context window, and fine-tuning them for longer sequences is often resource-intensive. Previous methodologies to extend the context length, such as Position Interpolation (PI) and NTK-Aware Rotary Positional Encodings (RoPE), require additional training steps or introduce significant PPL inflation with extended input lengths. This paper presents a sophisticated yet efficient alternative, focusing on a training-free paradigm.

Methodology: Dual Chunk Attention (DCA)

The DCA framework introduces a novel approach by segmenting the attention computation of long sequences into chunk-based modules. This allows for capturing both intra-chunk and inter-chunk positional information effectively, integrating with Flash Attention to enhance performance and efficiency. DCA consists of three critical components:

  1. Intra-Chunk Attention: This processes tokens within the same chunk, maintaining a fixed chunk size smaller than the pretraining window.
  2. Inter-Chunk Attention: This mechanism allows attention computations across different chunks, thereby preserving long-range dependencies.
  3. Successive Chunk Attention: This is designed to maintain locality by adjusting the position indices of tokens in neighboring chunks, ensuring accurate position representation for closely spaced tokens.

Through these components, DCA manages to retain global information and minimize perplexity across sequences, even when significantly extending the context length beyond the pretraining limits.

Numerical Validation

The experimental results presented in this paper underscore the efficacy of DCA. For instance, the Llama2 70B model, when equipped with DCA, achieves a perplexity (PPL) of 5.59 with a context length of 100k tokens. This is a negligible increase from its baseline PPL, showcasing DCA's ability to handle long-range dependencies efficiently. This performance stands in stark contrast to training-free methods such as PI and NTK, which show considerable PPL inflation beyond context lengths of 8k tokens.

Practical and Theoretical Implications

Practical Implications: DCA provides a cost-effective solution for various applications requiring the processing of extensive text sequences. This includes scenarios like analyzing extensive PDF documents, retaining long dialogue histories in conversational agents, or enabling high-resolution data summarization. By circumventing the need for repetitive and resource-intensive fine-tuning, DCA makes a strong case for practical deployment in real-world LLM applications.

Theoretical Implications: The introduction of chunk-based attention mechanisms with explicit intra-chunk and inter-chunk attention offers new insights into positional encoding and relative position matrix designs. This can stimulate further research into more refined attention mechanisms that can bridge the gap between local and global context comprehension in LLMs.

Future Directions

Given the promising results, future research might explore several avenues:

  1. Optimization of Chunk Sizes: Analyzing the impact of varying chunk sizes on different model architectures and datasets could yield even more optimized configurations.
  2. Hybrid Approaches: Combining DCA with other novel training-free approaches might further enhance the performance and scalability of LLMs.
  3. Application Specific Tuning: Tailoring the DCA methodology for specific application domains such as biomedical text mining or legal document analysis could significantly advance domain-specific LLM capabilities.

Conclusion

The Dual Chunk Attention method presented by the authors marks a significant advance in the field of LLMs by enabling training-free long-context scaling. With robust numerical results and practical evaluations, DCA stands out as a highly efficient tool for extending the context windows of LLMs. This work not only provides an immediate solution to existing limitations but also paves the way for future advancements in scalable LLMing. The open-sourcing of their code and data further enhances the potential for community engagement and iterative improvement in this critical area of machine learning research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  2. Anthropic. Introducing 100K Context Windows, 2023. URL https://www.anthropic.com/index/100k-context-windows.
  3. Clex: Continuous length extrapolation for large language models, 2023a.
  4. Extending context window of large language models via positional interpolation, 2023b.
  5. Longlora: Efficient fine-tuning of long-context large language models. arXiv:2309.12307, 2023c.
  6. Dissecting transformer length extrapolation via the lens of receptive field analysis, 2023.
  7. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  8. Monotonic location attention for length generalization, 2023.
  9. Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  10. Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning, 2023.
  11. Flashattention: Fast and memory-efficient exact attention with io-awareness. In NeurIPS, 2022.
  12. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  4599–4610, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.365. URL https://aclanthology.org/2021.naacl-main.365.
  13. Lm-infinite: Simple on-the-fly length generalization for large language models, 2023.
  14. Two stones hit one bird: Bilevel positional encoding for better length extrapolation, 2024.
  15. Lora: Low-rank adaptation of large language models, 2021.
  16. Llm maybe longlm: Self-extend llm context window without tuning, 2024.
  17. The impact of positional encoding on length generalization in transformers, 2023.
  18. The NarrativeQA reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6:317–328, 2018. doi: 10.1162/tacl˙a˙00023. URL https://aclanthology.org/Q18-1023.
  19. Prompted llms as chatbot modules for long open-domain conversation. In Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023. doi: 10.18653/v1/2023.findings-acl.277. URL http://dx.doi.org/10.18653/v1/2023.findings-acl.277.
  20. How long can open-source llms truly promise on context length. 2023a.
  21. Functional interpolation for relative positions improves long context transformers, 2023b.
  22. Lost in the middle: How language models use long contexts, 2023a.
  23. Scaling laws of rope-based extrapolation, 2023b.
  24. LMSYS. Vicuna: An open-source chatbot impressing gpt-4 with 90 URL https://lmsys.org/blog/2023-03-30-vicuna/.
  25. LocalLLaMA. Dynamically scaled rope further increases performance of long context llama with zero fine-tuning, July 2023a. URL https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/.
  26. LocalLLaMA. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation., June 2023b. URL https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_scaled_rope_allows_llama_models_to_have/.
  27. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  28. MosaicML. Introducing mpt-30b: Raising the bar for open-source foundation models, 2023a. URL www.mosaicml.com/blog/mpt-30b. Accessed: 2023-06-22.
  29. MosaicML. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b.
  30. OpenAI. Gpt-4 technical report, 2023.
  31. QuALITY: Question answering with long input texts, yes! In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5336–5358, Seattle, United States, July 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.naacl-main.391. URL https://aclanthology.org/2022.naacl-main.391.
  32. Yarn: Efficient context window extension of large language models, 2023.
  33. Train short, test long: Attention with linear biases enables input length extrapolation, 2022.
  34. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models. ArXiv, abs/2401.04658, 2024. URL https://api.semanticscholar.org/CorpusID:266900042.
  35. Compressive transformers for long-range sequence modelling. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=SylKikSYDH.
  36. Parallel context windows for large language models, 2023.
  37. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  38. Code llama: Open foundation models for code, 2023.
  39. Procedural text mining with large language models, 2023.
  40. Randomized positional encodings boost length generalization of transformers, 2023.
  41. Pdftriage: Question answering over long, structured documents, 2023.
  42. Zebra: Extending context window with layerwise grouped local-global attention, 2023.
  43. Su, J. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
  44. Roformer: Enhanced transformer with rotary position embedding, 2022.
  45. A length-extrapolatable transformer, 2022.
  46. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Together. Llama-2-7b-32k-instruct — and fine-tuning for llama-2 models with together api, 2023. URL https://together.ai/blog/llama-2-7b-32k-instruct.
  48. Llama: Open and efficient foundation language models, 2023a.
  49. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  50. Focused transformer: Contrastive training for context scaling, 2023.
  51. Attention is all you need, 2017.
  52. Learning to retrieve in-context examples for large language models, 2024.
  53. Leveraging large language models to power chatbots for collecting user self-reported data, 2023.
  54. Efficient streaming language models with attention sinks, 2023.
  55. Effective long-context scaling of foundation models. CoRR, abs/2309.16039, 2023. doi: 10.48550/ARXIV.2309.16039. URL https://doi.org/10.48550/arXiv.2309.16039.
  56. Compositional exemplars for in-context learning. arXiv preprint arXiv:2302.05698, 2023.
  57. Linear attention via orthogonal memory. ArXiv, abs/2312.11135, 2023. URL https://api.semanticscholar.org/CorpusID:266359128.
  58. Soaring from 4k to 400k: Extending llm’s context with activation beacon. ArXiv, abs/2401.03462, 2024. URL https://api.semanticscholar.org/CorpusID:266844488.
  59. QMSum: A new benchmark for query-based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.  5905–5921, Online, June 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.472. URL https://aclanthology.org/2021.naacl-main.472.
  60. Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Chenxin An (17 papers)
  2. Fei Huang (408 papers)
  3. Jun Zhang (1008 papers)
  4. Shansan Gong (14 papers)
  5. Xipeng Qiu (257 papers)
  6. Chang Zhou (105 papers)
  7. Lingpeng Kong (134 papers)
Citations (21)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com