Benchmarking Agentic Workflow Generation
The paper presents a novel benchmark, WorFBench, designed to evaluate the agentic workflow generation capabilities of LLMs. The focus is on decomposing complex tasks into executable workflows, a critical step in reasoning and planning tasks within real-world applications. Existing evaluation frameworks have primarily emphasized holistic performance or suffered from limited scenario coverage. WorFBench addresses these gaps by introducing multi-faceted scenarios and intricate graph workflow structures.
Key Contributions
- WorFBench Benchmark: A comprehensive agentic workflow benchmark incorporating diverse scenarios such as problem-solving, function calling, embodied planning, and open-grounded planning. The dataset provides a robust training set of 18,679 samples and a well-balanced test set of 2,146 samples, including held-out tasks for generalization evaluation.
- WorFEval Evaluation Protocol: A systemic evaluation protocol using advanced subsequence and subgraph matching algorithms to quantitatively assess LLMs' workflow generation abilities. This approach enables a more accurate understanding of the capabilities and limitations of different LLMs in generating complex workflows.
- Evaluation Findings: The paper conducts extensive evaluations across various LLMs, revealing significant gaps between sequence and graph planning capabilities. Notably, even state-of-the-art models such as GPT-4 demonstrate a considerable performance gap, indicating the challenges associated with generating complex, structured workflows.
- Training and Generalization: Two open-source models are fine-tuned on the WorFBench dataset to assess generalization capabilities. The findings suggest that while training can improve held-in task performance, generalization, especially in embodied tasks, remains challenging.
Implications and Future Directions
This paper's contributions have substantial implications for both theoretical and practical applications in AI. By providing a structured benchmark and evaluation protocol, it facilitates a deeper understanding of the planning capabilities of LLMs. The results underscore the necessity for further advancements in bridging the gap between linear and graph-structured workflow generation.
The insights gained from this research could influence future developments in AI, particularly in designing agents capable of robust planning in diverse environments. The challenges highlighted by the results suggest potential areas for integrating world knowledge or models to enhance agent realism and efficacy.
Future research may explore iterative or interactive planning paradigms and investigate methodologies for incorporating real-world knowledge into LLMs. These directions hold promise for advancing the capabilities of AI agents towards achieving more practical and effective planning abilities in real-world scenarios.
Overall, WorFBench represents a significant step forward in evaluating and improving the workflow generation capabilities of LLMs, setting the stage for future innovations in agent-driven AI applications.