Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Two are better than one: Context window extension with multi-grained self-injection (2410.19318v1)

Published 25 Oct 2024 in cs.CL, cs.AI, and cs.LG

Abstract: The limited context window of contemporary LLMs remains a huge barrier to their broader application across various domains. While continual pre-training on long-context data is a straightforward and effective solution, it incurs substantial costs in terms of data acquisition and computational resources. To alleviate this issue, we propose SharedLLM, a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval. SharedLLM is composed of two short-context LLMs such as LLaMA-2, termed upper model and lower model. The lower model functions as a compressor while the upper model acts as a decoder. The upper model receives compressed, multi-grained context information from the lower model and performs context-aware modeling on the running text. Information transfer between the compressor and decoder occurs only at the lowest layers to refrain from long forward paths in the lower model and redundant cross-attention modules in the upper model. Based on this architecture, we introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks. This structure, combined with a search algorithm, enables rapid encoding and retrieval of relevant information from various levels of the tree based on the input query. This entire process, wherein the sender and receiver are derived from the same LLM layer, is referred to as self-injection.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. Training-free long-context scaling of large language models. In Forty-first International Conference on Machine Learning, 2024.
  3. Llemma: An open language model for mathematics. In The Twelfth International Conference on Learning Representations, 2024.
  4. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023.
  5. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  6. Tom B Brown. Language models are few-shot learners. arXiv preprint arXiv:2005.14165, 2020.
  7. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  8. Longlora: Efficient fine-tuning of long-context large language models. In The Twelfth International Conference on Learning Representations, 2024.
  9. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  3829–3846, Singapore, December 2023a. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp-main.232. URL https://aclanthology.org/2023.emnlp-main.232.
  10. Adapting language models to compress contexts. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.  3829–3846, 2023b.
  11. Palm: Scaling language modeling with pathways. arxiv 2022. arXiv preprint arXiv:2204.02311, 10, 2022.
  12. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  13. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. In The Twelfth International Conference on Learning Representations, 2023.
  14. Flashattention: Fast and memory-efficient exact attention with io-awareness. In Proceedings of the 35th Neural Information Processing Systems Conference (NeurIPS), 2022.
  15. Longnet: Scaling transformers to 1,000,000,000 tokens. arXiv preprint arXiv:2307.02486, 2023.
  16. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024.
  17. From local to global: A graph rag approach to query-focused summarization. arXiv preprint arXiv:2404.16130, 2024.
  18. Data engineering for scaling language models to 128k context. arXiv preprint arXiv:2402.10171, 2024.
  19. Chatglm: A family of large language models from glm-130b to glm-4 all tools, 2024.
  20. Lm-infinite: Simple on-the-fly length generalization for large language models. ArXiv, abs/2308.16137, 2023. URL https://api.semanticscholar.org/CorpusID:261339508.
  21. Two stones hit one bird: Bilevel positional encoding for better length extrapolation. In Forty-first International Conference on Machine Learning, 2024.
  22. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024.
  23. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  24. Transformers are rnns: Fast autoregressive transformers with linear attention. In International conference on machine learning, pp.  5156–5165. PMLR, 2020.
  25. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pp.  611–626, 2023.
  26. Evaluating quantized large language models. ArXiv, abs/2402.18158, 2024. URL https://api.semanticscholar.org/CorpusID:268041618.
  27. Yinhan Liu. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  28. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  29. Train short, test long: Attention with linear biases enables input length extrapolation. arXiv preprint arXiv:2108.12409, 2021.
  30. Compressive transformers for long-range sequence modelling. In International Conference on Learning Representations, 2020.
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  32. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  33. Together. Redpajama: An open source recipe to reproduce llama training dataset, 2023. URL https://github.com/togethercomputer/RedPajama-Data.
  34. TogetherAI. Llama-2-7b-32k-instruct - and fine-tuning for llama-2 models with together api, 2023. URL https://www.together.ai/blog/llama-2-7b-32k-instruct.
  35. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023b.
  37. Natural language processing with transformers. ” O’Reilly Media, Inc.”, 2022.
  38. Focused Transformer: Contrastive Training for Context Scaling. CoRR, abs/2307.0, 2023.
  39. Focused transformer: Contrastive training for context scaling. Advances in Neural Information Processing Systems, 36, 2024.
  40. Finetuned language models are zero-shot learners. In International Conference on Learning Representations, 2021.
  41. Memorizing transformers. In International Conference on Learning Representations, 2022.
  42. Infllm: Unveiling the intrinsic capacity of llms for understanding extremely long sequences with training-free memory. arXiv preprint arXiv:2402.04617, 2024a.
  43. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024b.
  44. Effective long-context scaling of foundation models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp.  4643–4663, Mexico City, Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.260. URL https://aclanthology.org/2024.naacl-long.260.
  45. Long-context language modeling with parallel context encoding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  2588–2610, Bangkok, Thailand, August 2024. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.142.
  46. Soaring from 4k to 400k: Extending llm’s context with activation beacon. arXiv preprint arXiv:2401.03462, 2024a.
  47. Extending llama-3’s context ten-fold overnight, 2024b.
  48. ∞\infty∞Bench: Extending long context evaluation beyond 100K tokens. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  15262–15277, Bangkok, Thailand, August 2024c. Association for Computational Linguistics. URL https://aclanthology.org/2024.acl-long.814.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com