Benchmarking Agentic Workflow Generation (2410.07869v2)

Published 10 Oct 2024 in cs.CL, cs.AI, cs.HC, cs.LG, and cs.MA

Abstract: LLMs, with their exceptional ability to handle a wide range of tasks, have driven significant advancements in tackling reasoning and planning tasks, wherein decomposing complex problems into executable workflows is a crucial step in this process. Existing workflow evaluation frameworks either focus solely on holistic performance or suffer from limitations such as restricted scenario coverage, simplistic workflow structures, and lax evaluation standards. To this end, we introduce WorFBench, a unified workflow generation benchmark with multi-faceted scenarios and intricate graph workflow structures. Additionally, we present WorFEval, a systemic evaluation protocol utilizing subsequence and subgraph matching algorithms to accurately quantify the LLM agent's workflow generation capabilities. Through comprehensive evaluations across different types of LLMs, we discover distinct gaps between the sequence planning capabilities and graph planning capabilities of LLM agents, with even GPT-4 exhibiting a gap of around 15%. We also train two open-source models and evaluate their generalization abilities on held-out tasks. Furthermore, we observe that the generated workflows can enhance downstream tasks, enabling them to achieve superior performance with less time during inference. Code and dataset are available at https://github.com/zjunlp/WorFBench.

PDF HTML Abstract

Benchmarking Agentic Workflow Generation

The paper presents a novel benchmark, WorFBench, designed to evaluate the agentic workflow generation capabilities of LLMs. The focus is on decomposing complex tasks into executable workflows, a critical step in reasoning and planning tasks within real-world applications. Existing evaluation frameworks have primarily emphasized holistic performance or suffered from limited scenario coverage. WorFBench addresses these gaps by introducing multi-faceted scenarios and intricate graph workflow structures.

Key Contributions

WorFBench Benchmark: A comprehensive agentic workflow benchmark incorporating diverse scenarios such as problem-solving, function calling, embodied planning, and open-grounded planning. The dataset provides a robust training set of 18,679 samples and a well-balanced test set of 2,146 samples, including held-out tasks for generalization evaluation.
WorFEval Evaluation Protocol: A systemic evaluation protocol using advanced subsequence and subgraph matching algorithms to quantitatively assess LLMs' workflow generation abilities. This approach enables a more accurate understanding of the capabilities and limitations of different LLMs in generating complex workflows.
Evaluation Findings: The paper conducts extensive evaluations across various LLMs, revealing significant gaps between sequence and graph planning capabilities. Notably, even state-of-the-art models such as GPT-4 demonstrate a considerable performance gap, indicating the challenges associated with generating complex, structured workflows.
Training and Generalization: Two open-source models are fine-tuned on the WorFBench dataset to assess generalization capabilities. The findings suggest that while training can improve held-in task performance, generalization, especially in embodied tasks, remains challenging.

Implications and Future Directions

This paper's contributions have substantial implications for both theoretical and practical applications in AI. By providing a structured benchmark and evaluation protocol, it facilitates a deeper understanding of the planning capabilities of LLMs. The results underscore the necessity for further advancements in bridging the gap between linear and graph-structured workflow generation.

The insights gained from this research could influence future developments in AI, particularly in designing agents capable of robust planning in diverse environments. The challenges highlighted by the results suggest potential areas for integrating world knowledge or models to enhance agent realism and efficacy.

Future research may explore iterative or interactive planning paradigms and investigate methodologies for incorporating real-world knowledge into LLMs. These directions hold promise for advancing the capabilities of AI agents towards achieving more practical and effective planning abilities in real-world scenarios.

Overall, WorFBench represents a significant step forward in evaluating and improving the workflow generation capabilities of LLMs, setting the stage for future innovations in agent-driven AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Shuofei Qiao (19 papers)
Runnan Fang (8 papers)
Zhisong Qiu (2 papers)
Xiaobin Wang (39 papers)
Ningyu Zhang (148 papers)
Yong Jiang (194 papers)
Pengjun Xie (85 papers)
Fei Huang (408 papers)
Huajun Chen (198 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/zxlzr/status/1848034469506330759

https://twitter.com/zxlzr/status/1851833982692004138

https://twitter.com/NiekTax/status/1853160165064786358

https://twitter.com/arXivGPT/status/1845538856289857539

https://twitter.com/arXivGPT/status/1845176289935663233

https://twitter.com/arXivGPT/status/1845904394203935131