XL3M: A Training-free Framework for LLM Length Extension Based on Segment-wise Inference (2405.17755v1)
Abstract: Length generalization failure problem, namely the LLM fails to generalize to texts longer than its maximum training length, greatly restricts the application of LLM in the scenarios with streaming long inputs. To address this problem, the existing methods either require substantial costs or introduce precision loss. In this paper, we empirically find that the accuracy of the LLM's prediction is highly correlated to its certainty. Based on this, we propose an efficient training free framework, named XL3M (it means extra-long LLM), which enables the LLMs trained on short sequences to reason extremely long sequence without any further training or fine-tuning. Under the XL3M framework, the input context will be firstly decomposed into multiple short sub-contexts, where each sub-context contains an independent segment and a common question'' which is a few tokens from the end of the original context. Then XL3M gives a method to measure the relevance between each segment and the
question'', and constructs a concise key context by splicing all the relevant segments in chronological order. The key context is further used instead of the original context to complete the inference task. Evaluations on comprehensive benchmarks show the superiority of XL3M. Using our framework, a Llama2-7B model is able to reason 20M long sequences on an 8-card Huawei Ascend 910B NPU machine with 64GB memory per card.
- Exploring length generalization in large language models. Advances in Neural Information Processing Systems, 35:38546–38556, 2022.
- Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
- Unlimiformer: Long-range transformers with unlimited length input. Advances in Neural Information Processing Systems, 36, 2024.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
- Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
- Challenges and applications of large language models. arXiv preprint arXiv:2307.10169, 2023.
- The impact of positional encoding on length generalization in transformers. arXiv preprint arXiv:2305.19466, 2023.
- How long can open-source llms truly promise on context length, 2023.
- Leave no context behind: Efficient infinite context transformers with infini-attention. arXiv preprint arXiv:2404.07143, 2024.
- A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435, 2023.
- Giraffe: Adventures in expanding context lengths in llms. arXiv preprint arXiv:2308.10882, 2023.
- Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
- Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Parallel context windows for large language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 6383–6402, 2023.
- Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
- Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Focused transformer: Contrastive training for context scaling. arXiv preprint arXiv:2307.03170, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- An overview on language models: Recent developments and outlook. arXiv preprint arXiv:2303.05759, 2023.
- Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
- Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617, 2024.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Effective long-context scaling of foundation models. arXiv preprint arXiv:2309.16039, 2023.
- A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
- Shengnan Wang (12 papers)
- Youhui Bai (4 papers)
- Lin Zhang (342 papers)
- Pingyi Zhou (9 papers)
- Shixiong Zhao (7 papers)
- Gong Zhang (38 papers)
- Sen Wang (164 papers)
- Renhai Chen (4 papers)
- Hua Xu (78 papers)
- Hongwei Sun (14 papers)