Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs (2404.10308v1)

Published 16 Apr 2024 in cs.LG and cs.AI

Abstract: LLMs have shown remarkable performance in various natural language processing tasks. However, a primary constraint they face is the context limit, i.e., the maximum number of tokens they can process. Previous works have explored architectural changes and modifications in positional encoding to relax the constraint, but they often require expensive training or do not address the computational demands of self-attention. In this paper, we present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations. HOMER uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks. Each chunk is then processed collectively, employing a hierarchical strategy that merges adjacent chunks at progressive transformer layers. A token reduction technique precedes each merging, ensuring memory usage efficiency. We also propose an optimized computational order reducing the memory requirement to logarithmically scale with respect to input length, making it especially favorable for environments with tight memory restrictions. Our experiments demonstrate the proposed method's superior performance and memory efficiency, enabling the broader use of LLMs in contexts requiring extended context. Code is available at https://github.com/alinlab/HOMER.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020.
  2. Unlimiformer: Long-range transformers with unlimited length input. arXiv preprint arXiv:2305.01625, 2023.
  3. bloc97. Ntk-aware scaled rope allows llama models to have extended (8k+) context size without any fine-tuning and minimal perplexity degradation. https://www.reddit.com/r/LocalLLaMA/comments/14lz7j5/ntkaware_%20scaled_rope_allows_llama_models_to_have/, 2023.
  4. Token merging: Your vit but faster. arXiv preprint arXiv:2210.09461, 2022.
  5. Carol Chen. Transformer inference arithmetic. https://kipp.ly/blog/transformer-inference-arithmetic/, 2022.
  6. Extending context window of large language models via positional interpolation. arXiv preprint arXiv:2306.15595, 2023.
  7. Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509, 2019.
  8. Rethinking attention with performers. arXiv preprint arXiv:2009.14794, 2021.
  9. Transformer-xl: Attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860, 2019.
  10. Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=YicbFdNTTy.
  12. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137, 2023.
  13. Which tokens to use? investigating token reduction in vision transformers. arXiv preprint arXiv:2308.04657, 2023.
  14. Efficient long-text understanding with short-text models. Transactions of the Association for Computational Linguistics, 11:284–299, 2023.
  15. Leveraging passage retrieval with generative models for open domain question answering. arXiv preprint arXiv:2007.01282, 2020.
  16. kaiokendev. Things i’m learning while training superhot. https://kaiokendev.github.io/til#extending-context-to-8k./, 2023.
  17. Transformers are rnns: Fast autoregressive transformers with linear attention. In International Conference on Machine Learning, 2020.
  18. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020.
  19. Not all patches are what you need: Expediting vision transformers via token reorganizations. arXiv preprint arXiv:2202.07800, 2022.
  20. Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843, 2016.
  21. Landmark attention: Random-access infinite context length for transformers. arXiv preprint arXiv:2305.16300, 2023.
  22. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  23. Quality: Question answering with long input texts, yes! arXiv preprint arXiv:2112.08608, 2021.
  24. Yarn: Efficient context window extension of large language models. arXiv preprint arXiv:2309.00071, 2023.
  25. Blockwise self-attention for long document understanding. arXiv preprint arXiv:1911.02972, 2019.
  26. Compressive transformers for long-range sequence modelling. arXiv preprint arXiv:1911.05507, 2019.
  27. Zeroscrolls: A zero-shot benchmark for long text understanding. arXiv preprint arXiv:2305.14196, 2023.
  28. Jianlin Su. Rectified rotary position embeddings. https://github.com/bojone/rerope, 2023.
  29. Roformer: Enhanced transformer with rotary position embedding. arXiv preprint arXiv:2104.09864, 2021.
  30. Scaling laws vs model architectures: How does inductive bias influence scaling? arXiv preprint arXiv:2207.10551, 2022.
  31. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  32. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768, 2020.
  33. Augmenting language models with long-term memory. arXiv preprint arXiv:2306.07174, 2023.
  34. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022.
  35. Big bird: Transformers for longer sequences. Neural Information Processing Systems, 2020.
  36. Calibrate before use: Improving few-shot performance of language models. In International Conference on Machine Learning, pp.  12697–12706. PMLR, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Woomin Song (6 papers)
  2. Seunghyuk Oh (5 papers)
  3. Sangwoo Mo (20 papers)
  4. Jaehyung Kim (44 papers)
  5. Sukmin Yun (10 papers)
  6. Jung-Woo Ha (67 papers)
  7. Jinwoo Shin (196 papers)
Citations (8)
Youtube Logo Streamline Icon: https://streamlinehq.com