Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning (2401.01325v3)

Published 2 Jan 2024 in cs.CL, cs.AI, and cs.LG
LLM Maybe LongLM: Self-Extend LLM Context Window Without Tuning

Abstract: It is well known that LLMs cannot generalize well to long contexts whose lengths are larger than the training sequence length. This poses challenges when employing LLMs for processing long input sequences during inference. In this work, we argue that LLMs themselves have inherent capabilities to handle long contexts without fine-tuning. To achieve this goal, we propose SelfExtend to extend the context window of LLMs by constructing bi-level attention information: the grouped attention and the neighbor attention. The grouped attention captures the dependencies among tokens that are far apart, while neighbor attention captures dependencies among adjacent tokens within a specified range. The two-level attentions are computed based on the original model's self-attention mechanism during inference. With minor code modification, our SelfExtend can effortlessly extend existing LLMs' context window without any fine-tuning. We conduct comprehensive experiments on multiple benchmarks and the results show that our SelfExtend can effectively extend existing LLMs' context window length. The code can be found at \url{https://github.com/datamllab/LongLM}.

Overview

This paper introduces a novel approach, termed Self-Extend, for empowering LLMs to process significantly longer text sequences beyond their original training limitations. The method enables LLMs to recognize and utilize longer contexts without the need for any fine-tuning or additional training. The fundamental premise is that LLMs, akin to humans who can comprehend lengthy texts without being exclusively trained on them, possess an intrinsic capability to comprehend long contexts that has not been fully leveraged.

Methodology

The challenge in handling longer texts by LLMs is often attributed to the relative positional encoding becoming out-of-distribution (O.O.D) when the model encounters sequence lengths beyond its pretraining context window. To address this, Self-Extend proposes a mapping technique that realigns unseen relative positions during inference back to positions encountered during training, effectively mimicking the way humans approximate the relative significance of distant text parts.

This is achieved through what the authors call "grouped attention", which applies a floor operation to partition the text into smaller clusters, allowing preservation of the order of information while reducing the granularity of positional data. Bi-level attention information, with a regular self-attention process for nearby token pairs and grouped attention for distant pairs, allows Self-Extend to precisely model near tokens while also maintaining coherence over the entire text. The method is simple to implement, requiring minimal modification to existing model code.

Experimental Results

Self-Extend's ability to extend context windows is validated through various settings, demonstrating its ability to keep perplexity low and help LLMs maintain performance on long-sequence inputs. The approach is tested against other context window extension methods, specifically on tasks requiring long-sequence understanding such as LLMing, synthetic long context tasks, and real-world long context tasks. Remarkably, Self-Extend often outperforms fine-tuning-based methods despite its non-learning nature and modest implementation requirements.

Implications and Conclusion

The paper concludes by emphasizing the potency of LLMs in handling longer contexts as evidenced by Self-Extend's performance. It highlights the potential cost savings and efficiency gains as Self-Extend requires no additional model training or fine-tuning. As a future direction, the researchers aim to enhance efficiency with Flash Attention implementation and consider advanced mapping strategies to further improve the capacity for extended contexts. They also recognize the current limitations regarding the finite extension of context windows and the necessity for consensus on evaluating long context tasks. Overall, Self-Extend offers a promising step toward fully unlocking the long-context processing abilities of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. amazon. Mistrallite model. https://huggingface.co/amazon/MistralLite, 2023. [Online; accessed 29-December-2023].
  2. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023.
  3. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Clex: Continuous length extrapolation for large language models. arXiv preprint arXiv:2310.16450, 2023a.
  6. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023b.
  7. Longlora: Efficient fine-tuning of long-context large language models. arXiv preprint arXiv:2309.12307, 2023c.
  8. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  9. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
  10. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  11. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  12. Flashattention: Fast and memory-efficient exact attention with io-awareness. Advances in Neural Information Processing Systems, 35:16344–16359, 2022.
  13. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  14. A framework for few-shot language model evaluation, 12 2023. URL https://zenodo.org/records/10256836.
  15. gkamradt. Llmtest_needleinahaystack: Doing simple retrieval from llm models. https://github.com/gkamradt/LLMTest_NeedleInAHaystack/tree/main, 2023. [Online; accessed 29-December-2023].
  16. Improving zero and few-shot generalization in dialogue through instruction tuning. arXiv preprint arXiv:2205.12673, 2022.
  17. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  18. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  20. Rethinking positional encoding in language pre-training. arXiv preprint arXiv:2006.15595, 2020.
  21. Solar 10.7 b: Scaling large language models with simple yet effective depth up-scaling. arXiv preprint arXiv:2312.15166, 2023.
  22. Truthfulqa: Measuring how models mimic human falsehoods. arXiv preprint arXiv:2109.07958, 2021.
  23. The flan collection: Designing data and methods for effective instruction tuning. arXiv preprint arXiv:2301.13688, 2023.
  24. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  25. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277, 2023a.
  26. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023b.
  27. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  28. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  29. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  30. Towards out-of-distribution generalization: A survey. arXiv preprint arXiv:2108.13624, 2021.
  31. Sparsebert: Rethinking the importance analysis in self-attention. In International Conference on Machine Learning, pp.  9547–9557. PMLR, 2021.
  32. RoFormer: Enhanced transformer with rotary position embedding, 2022. arXiv: 2104.09864.
  33. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp.  127063, 2023.
  34. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  35. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  36. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  37. Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
  38. mt5: A massively multilingual pre-trained text-to-text transformer. arXiv preprint arXiv:2010.11934, 2020.
  39. Yin Song and Chen Wu and Eden Duthie. amazon/MistralLite, 2023. URL https://huggingface.co/amazon/MistralLite.
  40. Big bird: Transformers for longer sequences. Advances in neural information processing systems, 33:17283–17297, 2020.
  41. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
  42. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Hongye Jin (15 papers)
  2. Xiaotian Han (46 papers)
  3. Jingfeng Yang (31 papers)
  4. Zhimeng Jiang (33 papers)
  5. Zirui Liu (58 papers)
  6. Chia-Yuan Chang (18 papers)
  7. Huiyuan Chen (43 papers)
  8. Xia Hu (186 papers)
Citations (68)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com