A Comprehensive Evaluation of Tool-Assisted Generation Strategies (2310.10062v2)
Abstract: A growing area of research investigates augmenting LLMs with tools (e.g., search engines, calculators) to overcome their shortcomings (e.g., missing or incorrect knowledge, incorrect logical inferences). Various few-shot tool-usage strategies have been proposed. However, there is no systematic and fair comparison across different strategies, or between these strategies and strong baselines that do not leverage tools. We conduct an extensive empirical analysis, finding that (1) across various datasets, example difficulty levels, and models, strong no-tool baselines are competitive to tool-assisted strategies, implying that effectively using tools with in-context demonstrations is a difficult unsolved problem; (2) for knowledge-retrieval tasks, strategies that refine incorrect outputs with tools outperform strategies that retrieve relevant information ahead of or during generation; (3) tool-assisted strategies are expensive in the number of tokens they require to work -- incurring additional costs by orders of magnitude -- which does not translate into significant improvement in performance. Overall, our findings suggest that few-shot tool integration is still an open challenge, emphasizing the need for comprehensive evaluations of future strategies to accurately assess their benefits and costs.
- Giving BERT a calculator: Finding operations and arguments with reading comprehension. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5947–5952, Hong Kong, China. Association for Computational Linguistics.
- Language models are few-shot learners. CoRR, abs/2005.14165.
- Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks.
- Palm: Scaling language modeling with pathways.
- Scaling instruction-finetuned language models.
- Training verifiers to solve math word problems. CoRR, abs/2110.14168.
- Documenting large webtext corpora: A case study on the colossal clean crawled corpus. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 1286–1305, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In North American Chapter of the Association for Computational Linguistics.
- Rarr: Researching and revising what language models say, using language models.
- Pal: Program-aided language models.
- Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies. Transactions of the Association for Computational Linguistics (TACL).
- Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings.
- Rethinking with retrieval: Faithful large language model inference. arXiv preprint arXiv:2301.00303.
- Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398.
- Stop uploading test data in plain text: Practical strategies for mitigating data contamination by evaluation benchmarks.
- Neural network gradient-based learning of black-box function interfaces. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Active retrieval augmented generation.
- Large language models struggle to learn long-tail knowledge. arXiv preprint arXiv:2211.08411.
- Demonstrate-search-predict: Composing retrieval and language models for knowledge-intensive nlp.
- Decomposed prompting: A modular approach for solving complex tasks.
- Retrieval-augmented generation for knowledge-intensive nlp tasks.
- Api-bank: A benchmark for tool-augmented llms.
- Deductive verification of chain-of-thought reasoning.
- Mind’s eye: Grounded language model reasoning through simulation.
- Faithful chain-of-thought reasoning.
- Inbal Magar and Roy Schwartz. 2022. Data contamination: From memorization to exploitation. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 157–165, Dublin, Ireland. Association for Computational Linguistics.
- When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511.
- Augmented language models: a survey.
- NumGLUE: A suite of fundamental yet challenging mathematical reasoning tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3505–3523, Dublin, Ireland. Association for Computational Linguistics.
- Webgpt: Browser-assisted question-answering with human feedback.
- Learning a natural language interface with neural programmer. In International Conference on Learning Representations.
- Large dual encoders are generalizable retrievers. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 9844–9855, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
- OpenAI. 2023. Gpt-4 technical report.
- Art: Automatic multi-step reasoning and tool-use for large language models.
- Talm: Tool augmented language models.
- Gorilla: Large language model connected with massive apis.
- True few-shot learning with language models.
- Measuring and narrowing the compositionality gap in language models.
- Limitations of language models in arithmetic and symbolic induction.
- Tool learning with foundation models.
- Toolformer: Language models can teach themselves to use tools.
- Ul2: Unifying language learning paradigms.
- Neural arithmetic logic units.
- Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions.
- MuSiQue: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics.
- Chain-of-thought prompting elicits reasoning in large language models.
- Alon Jacovi (26 papers)
- Avi Caciularu (46 papers)
- Jonathan Herzig (34 papers)
- Roee Aharoni (35 papers)
- Bernd Bohnet (21 papers)
- Mor Geva (58 papers)