Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use (2312.04455v4)

Published 7 Dec 2023 in cs.CL, cs.AI, and cs.LG
Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use

Abstract: In this paper, we demonstrate that an inherent waveform pattern in the attention allocation of LLMs significantly affects their performance in tasks demanding a high degree of context awareness, such as utilizing LLMs for tool-use. Specifically, the crucial information in the context will be potentially overlooked by model when it is positioned in the trough zone of the attention waveform, leading to decreased performance. To address this issue, we propose a novel inference method named Attention Buckets. It allows LLMs to process their input through multiple parallel processes. Each process utilizes a distinct base angle for the rotary position embedding, thereby creating a unique attention waveform. By compensating an attention trough of a particular process with an attention peak of another process, our approach enhances LLM's awareness to various contextual positions, thus mitigating the risk of overlooking crucial information. In the largest tool-use benchmark, our method elevates a 7B model to achieve state-of-the-art performance, comparable to that of GPT-4. On other benchmarks and some RAG tasks, which also demand a thorough understanding of contextual content, Attention Buckets also exhibited notable enhancements in performance.

Understanding the Impact of Attention Allocation in LLMs

Context Awareness Challenge in LLMs

LLMs have become highly skilled as 'tool agents' capable of complex functionality. However, an often-overlooked aspect of their capabilities is the model's attention mechanism—specifically, how attention is allocated to different parts of the context data it processes. This paper examines how the model's attention pattern, which can exhibit a waveform, affects its performance when using tools. The crux of the issue is that essential information can be missed if it coincides with what this paper terms an 'attention trough' in the waveform pattern.

Attention Buckets: Parallel Processing Enhancement

To mitigate the risk of missing critical details in the context, researchers introduced a novel method called Attention Buckets. This technique leverages parallel processing, allowing an LLM to handle multiple context versions simultaneously, each with a different Rotary Position Embedding (RoPE) angle base. Doing so varies the attention pattern for each version. The pivotal idea is to compensate for the troughs in one model's attention wave with peaks from another run, thus covering all important pieces of information. The output of these parallel executions is then aggregated, combining the strengths of varied attention allocations for a comprehensive understanding and a more robust performance.

State-of-the-Art Benchmarks Achieved

The proposed method was rigorously tested on a recognized tool use benchmark and the results were notable. By enhancing a 7-billion-parameter open-source model with Attention Buckets, researchers achieved state-of-the-art performance, matching that of the much larger GPT-4 model. Furthermore, when employed with various reasoning methods, it showed improvements over baselines without Attention Buckets. This success points to a significant leap forward in the tool-use proficiency of LLMs and opens up exciting possibilities for research into the fundamental capabilities of AI.

Broader Implications for Retrieval-Augmented Generation Tasks

Given the enhanced context awareness provided by Attention Buckets, its effectiveness extends beyond tool use. It shows promise for open-domain question answering (ODQA) tasks, which also require high levels of contextual comprehension. Experiments conducted on popular ODQA benchmarks, with Attention Buckets augmenting Llama-2-7B, demonstrated superior performance to dedicated QA models. Additionally, the choice of RoPE bases, and the search algorithm employed to select them, proved effective, further suggesting a wide applicability to numerous tasks that depend on high-context utilization.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Language models are few-shot learners, 2020.
  2. OpenAI. OpenAI: Introducing ChatGPT, 2022.
  3. OpenAI. Gpt-4 technical report, 2023.
  4. Toolllm: Facilitating large language models to master 16000+ real-world apis, 2023.
  5. Toolformer: Language models can teach themselves to use tools, 2023.
  6. Vipergpt: Visual inference via python execution for reasoning, 2023.
  7. Gorilla: Large language model connected with massive apis, 2023.
  8. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
  9. Lost in the middle: How language models use long contexts, 2023.
  10. Llama 2: Open foundation and fine-tuned chat models, 2023.
  11. Attention is all you need. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
  12. Roformer: Enhanced transformer with rotary position embedding, 2022.
  13. Llama: Open and efficient foundation language models, 2023.
  14. Qwen technical report, 2023.
  15. Baichuan 2: Open large-scale language models, 2023.
  16. Extending context window of large language models via positional interpolation, 2023.
  17. Empower your model with longer and better context comprehension, 2023.
  18. Scaling laws of rope-based extrapolation, 2023.
  19. Getmusic: Generating any music tracks with a unified representation and diffusion framework, 2023.
  20. Pose: Efficient context window extension of llms via positional skip-wise training, 2023.
  21. Effective long-context scaling of foundation models, 2023.
  22. Are we falling in a middle-intelligence trap? an analysis and mitigation of the reversal curse, 2023.
  23. How long can open-source llms truly promise on context length?, June 2023.
  24. A Universally Unique IDentifier (UUID) URN Namespace. RFC 4122, jul 2005.
  25. React: Synergizing reasoning and acting in language models, 2023.
  26. Reflexion: Language agents with verbal reinforcement learning, june 2023. arXiv preprint arXiv:2303.11366, 2023.
  27. Envisioning future from the past: Hierarchical duality learning for multi-turn dialogue generation. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7382–7394, Toronto, Canada, July 2023. Association for Computational Linguistics.
  28. DialoGPS: Dialogue path sampling in continuous semantic space for data augmentation in multi-turn conversations. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1267–1280, Toronto, Canada, July 2023. Association for Computational Linguistics.
  29. Target-side input augmentation for sequence to sequence generation. In International Conference on Learning Representations, 2022.
  30. Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051, 2017.
  31. Re-creation of creations: A new paradigm for lyric-to-melody generation. arXiv e-prints, pages arXiv–2208, 2022.
  32. Relevance-guided supervision for openqa with colbert. Transactions of the association for computational linguistics, 9:929–944, 2021.
  33. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  34. Rocketqa: An optimized training approach to dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2010.08191, 2020.
  35. Colbert: Efficient and effective passage search via contextualized late interaction over bert. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pages 39–48, 2020.
  36. Leveraging passage retrieval with generative models for open domain question answering, 2021.
  37. Generate rather than retrieve: Large language models are strong context generators. arXiv preprint arXiv:2209.10063, 2022.
  38. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  39. Semantic parsing on freebase from question-answer pairs. In Proceedings of the 2013 conference on empirical methods in natural language processing, pages 1533–1544, 2013.
  40. Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906, 2020.
  41. Augmented language models: a survey, 2023.
  42. Tool learning with foundation models, 2023.
  43. Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings, 2023.
  44. Webarena: A realistic web environment for building autonomous agents, 2023.
  45. Self-consistency improves chain of thought reasoning in language models, 2023.
  46. Label words are anchors: An information flow perspective for understanding in-context learning, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Yuhan Chen (39 papers)
  2. Ang Lv (19 papers)
  3. Ting-En Lin (28 papers)
  4. Changyu Chen (19 papers)
  5. Yuchuan Wu (33 papers)
  6. Fei Huang (408 papers)
  7. Yongbin Li (128 papers)
  8. Rui Yan (250 papers)
Citations (19)
Reddit Logo Streamline Icon: https://streamlinehq.com