Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ToolQA: A Dataset for LLM Question Answering with External Tools (2306.13304v1)

Published 23 Jun 2023 in cs.CL and cs.AI

Abstract: LLMs have demonstrated impressive performance in various NLP tasks, but they still suffer from challenges such as hallucination and weak numerical reasoning. To overcome these challenges, external tools can be used to enhance LLMs' question-answering abilities. However, current evaluation methods do not distinguish between questions that can be answered using LLMs' internal knowledge and those that require external information through tool use. To address this issue, we introduce a new dataset called ToolQA, which is designed to faithfully evaluate LLMs' ability to use external tools for question answering. Our development of ToolQA involved a scalable, automated process for dataset curation, along with 13 specialized tools designed for interaction with external knowledge in order to answer questions. Importantly, we strive to minimize the overlap between our benchmark data and LLMs' pre-training data, enabling a more precise evaluation of LLMs' tool-use reasoning abilities. We conducted an in-depth diagnosis of existing tool-use LLMs to highlight their strengths, weaknesses, and potential improvements. Our findings set a new benchmark for evaluating LLMs and suggest new directions for future advancements. Our data and code are freely available to the broader scientific community on GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (70)
  1. Training a helpful and harmless assistant with reinforcement learning from human feedback. arXiv preprint arXiv:2204.05862, 2022.
  2. Improving language models by retrieving from trillions of tokens. In International conference on machine learning, pages 2206–2240. PMLR, 2022.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023.
  5. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks, 2022.
  6. A dataset for answering time-sensitive questions. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2), 2021.
  7. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  8. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  9. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.
  10. Time-aware language models as temporal knowledge bases. Transactions of the Association for Computational Linguistics, 10:257–273, 2022.
  11. Pal: Program-aided language models. arXiv preprint arXiv:2211.10435, 2022.
  12. Measuring mathematical problem solving with the math dataset. NeurIPS, 2021.
  13. Mathprompter: Mathematical reasoning using large language models. arXiv preprint arXiv:2303.05398, 2023.
  14. Towards unsupervised dense information retrieval with contrastive learning. arXiv preprint arXiv:2112.09118, 2021.
  15. Few-shot learning with retrieval augmented language models. arXiv preprint arXiv:2208.03299, 2022.
  16. SciREX: A challenge dataset for document-level information extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7506–7516, Online, July 2020. Association for Computational Linguistics.
  17. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38, 2023.
  18. Genegpt: Augmenting large language models with domain tools for improved access to biomedical information. ArXiv, 2023.
  19. TriviaQA: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601–1611, Vancouver, Canada, July 2017. Association for Computational Linguistics.
  20. Dense passage retrieval for open-domain question answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 6769–6781, 2020.
  21. Realtime qa: What’s the answer right now? arXiv preprint arXiv:2207.13332, 2022.
  22. Language models can solve computer tasks, 2023.
  23. Large language models are zero-shot reasoners. In A. H. Oh, A. Agarwal, D. Belgrave, and K. Cho, editors, Advances in Neural Information Processing Systems, 2022.
  24. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  25. Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858, 2022.
  26. Api-bank: A benchmark for tool-augmented llms, 2023.
  27. Unsupervised cross-task generalization via retrieval augmentation. In Advances in Neural Information Processing Systems, 2022.
  28. Chameleon: Plug-and-play compositional reasoning with large language models. arXiv preprint arXiv:2304.09842, 2023.
  29. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022.
  30. Reacc: A retrieval-augmented code completion framework. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 6227–6240, 2022.
  31. A. Madaan and A. Yazdanbakhsh. Text and patterns: For effective chain of thought, it takes two to tango. arXiv preprint arXiv:2209.07686, 2022.
  32. When not to trust language models: Investigating effectiveness and limitations of parametric and non-parametric memories. arXiv preprint arXiv:2212.10511, 2022.
  33. Lila: A unified benchmark for mathematical reasoning. arXiv preprint arXiv:2210.17517, 2022.
  34. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  35. Investigating the limitations of transformers with simple arithmetic tasks. arXiv preprint arXiv:2102.13019, 2021.
  36. OpenAI. Gpt-4 technical report. arXiv, 2023.
  37. OpenAI. Introducing chatgpt, 2023.
  38. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  39. Art: Automatic multi-step reasoning and tool-use for large language models. arXiv preprint arXiv:2303.09014, 2023.
  40. Talm: Tool augmented language models. arXiv preprint arXiv:2205.12255, 2022.
  41. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023.
  42. Check your facts and try again: Improving large language models with external knowledge and automated feedback. arXiv preprint arXiv:2302.12813, 2023.
  43. Limitations of language models in arithmetic and symbolic induction. arXiv preprint arXiv:2208.05051, 2022.
  44. Tool learning with foundation models, 2023.
  45. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  46. The probabilistic relevance framework: Bm25 and beyond. Foundations and Trends® in Information Retrieval, 3(4):333–389, 2009.
  47. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  48. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023.
  49. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580, 2023.
  50. Replug: Retrieval-augmented black-box language models. arXiv preprint arXiv:2301.12652, 2023.
  51. Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv preprint arXiv:2303.11366, 2023.
  52. Adaplanner: Adaptive planning from feedback with language models, 2023.
  53. FEVER: a large-scale dataset for fact extraction and VERification. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 809–819, New Orleans, Louisiana, June 2018. Association for Computational Linguistics.
  54. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  55. Code4struct: Code generation for few-shot structured prediction from natural language. arXiv preprint arXiv:2210.12810, 2022.
  56. Describe, explain, plan and select: Interactive planning with large language models enables open-world multi-task agents, 2023.
  57. Chain-of-thought prompting elicits reasoning in large language models. arXiv, page 2201.11903v6, 2022.
  58. Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359, 2021.
  59. Reframing human-ai collaboration for generating free-text explanations. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 632–658, 2022.
  60. S. Wolfram. Wolfram|Alpha as the Way to Bring Computational Knowledge Superpowers to ChatGPT. Stephen Wolfram Writings, 2023.
  61. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671, 2023.
  62. On the tool manipulation capability of open-source large language models. arXiv preprint arXiv:2305.16504, 2023.
  63. Weakly-supervised scientific document classification via retrieval-augmented multi-stage training. arXiv preprint arXiv:2306.07193, 2023.
  64. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381, 2023.
  65. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380, Brussels, Belgium, Oct.-Nov. 2018. Association for Computational Linguistics.
  66. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2023.
  67. J. Zhang. Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt, 2023.
  68. M. Zhang and E. Choi. SituatedQA: Incorporating extra-linguistic contexts into QA. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7371–7387, Online and Punta Cana, Dominican Republic, Nov. 2021. Association for Computational Linguistics.
  69. Prboost: Prompt-based rule discovery and boosting for interactive weakly-supervised learning. arXiv preprint arXiv:2203.09735, 2022.
  70. ReSel: N-ary relation extraction from scientific text and tables by learning to retrieve and select. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 730–744, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuchen Zhuang (37 papers)
  2. Yue Yu (343 papers)
  3. Kuan Wang (30 papers)
  4. Haotian Sun (13 papers)
  5. Chao Zhang (907 papers)
Citations (144)