Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TaskBench: Benchmarking Large Language Models for Task Automation (2311.18760v4)

Published 30 Nov 2023 in cs.CL and cs.AI
TaskBench: Benchmarking Large Language Models for Task Automation

Abstract: In recent years, the remarkable progress of LLMs has sparked interest in task automation, which involves decomposing complex tasks described by user instructions into sub-tasks and invoking external tools to execute them, playing a central role in autonomous agents. However, there is a lack of systematic and standardized benchmarks to promote the development of LLMs in task automation. To address this, we introduce TaskBench, a comprehensive framework to evaluate the capability of LLMs in task automation. Specifically, task automation can be divided into three critical stages: task decomposition, tool selection, and parameter prediction. To tackle the complexities inherent in these stages, we introduce the concept of Tool Graph to represent decomposed tasks and adopt a back-instruct method to generate high-quality user instructions. We propose TaskEval, a multi-faceted evaluation methodology that assesses LLM performance across these three stages. Our approach combines automated construction with rigorous human verification, ensuring high consistency with human evaluation. Experimental results demonstrate that TaskBench effectively reflects the capabilities of various LLMs in task automation. It provides insights into model performance across different task complexities and domains, pushing the boundaries of what current models can achieve. TaskBench offers a scalable, adaptable, and reliable benchmark for advancing LLM-based autonomous agents.

TaskBench: Benchmarking LLMs for Task Automation

The paper introduces TaskBench, a benchmark designed to evaluate the capabilities of LLMs in the context of task automation. This work addresses the current gap in systematic benchmarking for LLM-driven task automation, focusing on evaluating these models across three critical dimensions: task decomposition, tool invocation, and parameter prediction.

Key Concepts and Contributions

  1. Task Automation Context: With advancements in LLMs, autonomous agents have shown promise in task automation. The paper categorizes this into three stages: breaking down a user's instructions, invoking the correct tools, and predicting necessary parameters to execute tasks successfully. Recognizing the complexity and lack of standardized benchmarks in this area, the authors present TaskBench.
  2. Tool Graph and Back-Instruct Method: A novel aspect of the paper is the introduction of the Tool Graph (TG), a representation method wherein tools are nodes connected by dependencies represented as edges. From TG, the authors derive user instructions using a 'back-instruct' methodology. This involves sampling sub-graphs from TG and synthesizing user prompts backward from these, ensuring the prompts reflect real-world task complexities.
  3. TaskEval Metrics: To evaluate LLM capabilities, the authors introduce TaskEval, a suite of metrics addressing different facets of task automation. These metrics objectively measure how well LLMs perform task decomposition, tool invocation, and parameter prediction.
  4. Experimental Validation: The paper presents experimental validation across multiple LLMs, including GPT variations and several open-source models. Results confirm that TaskBench effectively captures the task automation capabilities of these models, offering insights into their strengths and areas for improvement.

Experimental Findings

  • Task Decomposition: The models' ability to break down tasks reflects varying capabilities, with GPT-4 demonstrating superior performance compared to other models, an outcome likely tied to its advanced reasoning capabilities.
  • Tool Invocation Performance: Node prediction and edge prediction were found to vary significantly, with edge prediction generally being more challenging. This highlights the complexity of understanding dependencies between tasks.
  • Parameter Prediction: Predicting the correct parameters for tool execution remains a demanding task, with GPT-4 showing the highest accuracy, thus underscoring the challenge LLMs face in understanding detailed, context-specific requirements.

Implications and Future Directions

TaskBench represents a comprehensive benchmarking framework that fills a crucial gap in the evaluation of LLMs concerning task automation. The implications are multifold:

  • Practical Impact: By offering a method to evaluate LLM performance in complex, real-world task scenarios, TaskBench can guide developers in optimizing models for practical applications in autonomous systems.
  • Theoretical Insights: The approach can inspire further research into enhancing tool invocation strategies within LLMs, potentially leading to more nuanced models capable of intricate task executions.
  • Future Developments: The authors suggest extending TaskBench to cover more domains and refining the evaluation criteria, which could lead to even broader applicability and insights into LLM capabilities.

Conclusion

The TaskBench benchmark provides a structured and effective approach to assessing LLMs in task automation contexts. By dissecting and comprehensively evaluating how these models handle task decomposition, tool invocation, and parameter prediction, this work lays the groundwork for future innovations and improvements in the field of autonomous systems. With its novel data generation methodology and rigorous evaluation metrics, TaskBench is positioned as a significant contribution to AI research, particularly in enhancing the practical applicability of LLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Palm 2 technical report. CoRR, abs/2305.10403, 2023.
  2. Language models are few-shot learners. In NeurIPS, pp.  1–25, 2020.
  3. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  4. Training verifiers to solve math word problems. CoRR, abs/2110.14168, 2021.
  5. Significant Gravitas. Auto-gpt: An autonomous gpt-4 experiment. https://github.com/Significant-Gravitas/Auto-GPT, 2023.
  6. Measuring massive multitask language understanding. In ICLR, 2021.
  7. How long can open-source llms truly promise on context length?, June 2023a. URL https://lmsys.org/blog/2023-06-29-longchat.
  8. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval, 2023b.
  9. Taskmatrix.ai: Completing tasks by connecting foundation models with millions of apis. CoRR, abs/2303.16434, 2023.
  10. Agentbench: Evaluating llms as agents. CoRR, abs/2308.03688, 2023.
  11. Yohei Nakajima. Babyagi. https://github.com/yoheinakajima/babyagi, 2023.
  12. OpenAI. GPT-4 technical report. CoRR, abs/2303.08774, 2023.
  13. Training language models to follow instructions with human feedback. In NeurIPS, volume 35, pp.  27730–27744, 2022.
  14. Gorilla: Large language model connected with massive apis. CoRR, abs/2305.15334, 2023.
  15. Tool learning with foundation models. CoRR, abs/2304.08354, 2023.
  16. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
  17. Toolformer: Language models can teach themselves to use tools. CoRR, abs/2302.04761, 2023.
  18. Hugginggpt: Solving AI tasks with chatgpt and its friends in huggingface. CoRR, abs/2303.17580, 2023.
  19. Toolalpaca: Generalized tool learning for language models with 3000 simulated cases. CoRR, abs/2306.05301, 2023.
  20. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023a.
  21. MosaicML NLP Team. Introducing mpt-7b: A new standard for open-source, ly usable llms, 2023b. URL www.mosaicml.com/blog/mpt-7b. Accessed: 2023-03-28.
  22. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a.
  23. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b.
  24. Superglue: A stickier benchmark for general-purpose language understanding systems. In NeurIPS, pp.  3261–3275, 2019a.
  25. GLUE: A multi-task benchmark and analysis platform for natural language understanding. In ICLR, 2019b.
  26. Self-instruct: Aligning language models with self-generated instructions. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.  13484–13508, Toronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl-long.754. URL https://aclanthology.org/2023.acl-long.754.
  27. Huggingface’s transformers: State-of-the-art natural language processing. CoRR, abs/1910.03771, 2019.
  28. Wizardlm: Empowering large language models to follow complex instructions. arXiv preprint arXiv:2304.12244, 2023.
  29. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  30. Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
  31. Agieval: A human-centric benchmark for evaluating foundation models. CoRR, abs/2304.06364, 2023.
  32. Toolqa: A dataset for LLM question answering with external tools. CoRR, abs/2306.13304, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yongliang Shen (47 papers)
  2. Kaitao Song (46 papers)
  3. Xu Tan (164 papers)
  4. Wenqi Zhang (41 papers)
  5. Kan Ren (41 papers)
  6. Siyu Yuan (46 papers)
  7. Weiming Lu (54 papers)
  8. Dongsheng Li (240 papers)
  9. Yueting Zhuang (164 papers)
Citations (35)