WorFBench: Multi-Domain LLM Workflow Benchmark
- WorFBench is a large-scale benchmark that assesses LLMs’ ability to decompose complex tasks into directed acyclic graph workflows across multiple planning domains.
- The evaluation protocol, WorFEval, employs both subgraph and chain matching techniques to precisely measure workflow structure and execution order.
- The benchmark facilitates robust comparisons of standard and compressed LLMs while highlighting the impact of fine-tuning and scaling on real-world planning tasks.
WorFBench is a large-scale, multi-domain benchmark for evaluating the ability of LLM agents to decompose complex natural language tasks into executable, graph-structured workflows. It aims to rigorously quantify LLM workflow planning capabilities beyond simple linear chains, covering a broad array of real-world planning and tool-use scenarios. WorFBench is accompanied by WorFEval, a systemic subgraph-matching and chain-matching protocol that enables precise, structure-aware evaluation. The dataset and evaluation suite provide both breadth—spanning function-calling, structured problem solving, embodied interaction, and open-ended planning—and depth, with intricate dependency graphs and fine-grained annotation. WorFBench is widely used to benchmark both standard and compressed LLMs for agentic planning robustness, and it forms the workflow-generation track in the Agent Compression Benchmark (ACBench).
1. Motivation and Scope
WorFBench is designed in response to several limitations in existing workflow-planning and agentic LLM benchmarks. Prior frameworks typically restrict themselves to holistic end-to-end tool-use metrics or evaluate LLM plans only in narrowly defined domains using linear, chain-like workflows. Such limitations obscure critical aspects of complex task decomposition, such as parallelism, conditional branching, and multi-agent or environment interaction. WorFBench unifies four primary planning domains—function-calling, classical reasoning (mathematics, commonsense, multimodal tasks), embodied interaction (incl. ALFWorld, WebShop, shell/OS control), and open-grounded real-world planning (WikiHow)—into a single Directed Acyclic Graph (DAG) workflow formalism. Each task in the benchmark is independently validated by GPT-4 synthesis and human annotation to ensure high semantic and structural fidelity (Qiao et al., 2024).
2. Dataset Construction and Graph Format
The WorFBench dataset comprises 18,679 training tasks and 2,146 test tasks, with an additional 723 held-out challenge ("seal") tasks for evaluating model generalization. Each example consists of:
- A natural-language goal statement.
- A list of available functions/APIs (when appropriate).
- A ground-truth workflow, represented as a DAG , where each node is a minimal, executable subtask and each edge encodes a “must happen before” dependency.
Nodes are annotated with labels ("action" and arguments); edges are explicitly enumerated. The domain coverage ensures that DAGs exhibit substantial structural diversity, including serial plans, parallel branches, re-merged dependencies, and environment-conditioned decision points. Average workflow size is 4.17 nodes, with multiple topological orderings for each graph to accommodate the inherent variability in LLM output sequence.
3. Evaluation Protocol: WorFEval
WorFEval is a graph and chain alignment protocol for evaluating predicted workflows against gold-standard graphs . The protocol involves:
- Bipartite node alignment: Nodes in each workflow are aligned one-to-one based on semantic similarity (Sentence-BERT cosine above threshold ), yielding valid node matchings.
- Subsequence (Chain) Matching: For each possible topological order of , the maximal common subsequence (Longest Increasing Subsequence, LIS) between the predicted sequence and the gold is determined. Chain-level precision, recall, and F1-score are computed:
- Subgraph Matching: Computes the Maximum Common Induced Subgraph (MCIS), giving analogous F1 for DAG structure quality.
This dual protocol captures both LLMs' ability to preserve correct action sequences and their facility with explicit, parallel graph dependencies (Qiao et al., 2024, 2505.19433).
4. Experimental Findings and Domain Results
Extensive multi-model evaluation highlights key trends:
- Saturation of sequential planning (chain) F1 scores: GPT-4 achieves vs. , manifesting a persistent ~15 percentage point gap between linear and graph-structured planning.
- Scaling effects: 72B parameter models improve by ~6–11 pp over 7B models, yet some recent 7B models (InternLM-2.5-7B, Qwen-2-7B) outperform older, larger models due to data and training improvements.
- Domain breakdown: Function-calling, embodied, and open-grounded domains each reveal characteristic pathologies—for example, linear chains predominate in embodied sequences, while parallelism dominates function/tool settings.
- Fine-tuning sharply boosts held-in-task scores ( for Qwen-2-7B+FT) but generalization to unseen task domains remains limited, with notably lower scores on embodied or hybrid tasks (Qiao et al., 2024).
5. Impact of Model Compression: Robustness Analysis
WorFBench is integral to the ACBench suite for evaluating LLM planning under compression. Studies show that:
- Quantization (AWQ, GPTQ) and unstructured pruning (Wanda, SparseGPT) applied post-training incur at most a 1–3% F1 drop on WorFBench, even as next-token LM metrics may degrade 10–20% (2505.19433).
- Structured pruning is catastrophic on large models without extensive fine-tuning, sharply degrading graph-generation performance.
- High-level graph planning persists under compression due to abstraction redundancy and tolerance for numeric drift; workflows depend primarily on skeleton sequence and function recognition.
Three additional analysis metrics are introduced:
- eRank: Entropy of singular values of weight/activation matrices; collapse in eRank correlates with workflow planning loss.
- Top-k Ranking Consistency: Jaccard overlap of top- logits in baseline vs. compressed model; drops associate with missing workflow steps.
- Energy Difference: Absolute shift in negative log-sum-exp (temperature 1) of logits; large aligns with precision/recall drops in graph edges.
A summary of representative F1 scores is shown below:
| Model | Baseline F1 | AWQ (INT4) | GPTQ (INT4) | Wanda (Unstr) | SparseGPT (Unstr) | Max Drop |
|---|---|---|---|---|---|---|
| Qwen-2.5-7B | 0.44 | 0.46 | 0.44 | 0.46 | 0.46 | +0.02 |
| Qwen-2.5-32B | 0.72 | 0.71 | 0.69 | 0.46* | 0.46* | -0.03 |
| InternLM-2.5-7B | 0.43 | 0.45 | 0.43 | 0.46 | 0.46 | +0.03 |
(*Significant drops for structured pruning on large models.)
6. Task Format, Error Modalities, and Downstream Utility
Inputs consist of a natural-language goal and available tools (where relevant); outputs are a JSON graph with "nodes" and "edges," with each node labeled by action and arguments. Strict subgraph matching is used for evaluation, making the metric sensitive to even minor deviations.
Common error modes include:
- Missing intermediate nodes or edges (under-recall), e.g., due to quantization perturbing action priorities.
- Spurious dependencies (precision loss) from token sequence drift or format parsing errors under pruning.
Despite these, even aggressively quantized/pruned models (>95%) nearly always produce correct DAGs, suggesting substantial robustness in high-level planning.
Downstream, workflows generated using WorFBench schemas:
- Increase accuracy and invocation success on tool use and embodied planning benchmarks (e.g., ALFWorld, StableToolBench), often outperforming baseline agents by 2–19 pp.
- Accelerate inference by enabling parallel execution along workflow critical paths (20–33% reduction in runtime).
- Reduce planning steps by anchoring model reasoning against a structured plan, mitigating drifting in long-form tasks (Qiao et al., 2024).
7. Limitations, Extensions, and Open Challenges
Open challenges and directions include:
- Generalization: While fine-tuned models excel on in-distribution tasks, substantial drops are observed on held-out and hybrid domains, underscoring the need for structurally generalizable planning architectures.
- DAG complexity: Graph size and degree directly impact model performance, revealing scaling bottlenecks not evident in chain-based evaluation.
- Evaluation sensitivity: Subgraph matching is robust to node order permutations but sensitive to spurious or missing dependencies; alternative or hierarchical metrics could provide complementary insights.
A plausible implication is that future workflow benchmarks may incorporate dynamic graphs, conditional loops, or collaborative agent flows, reflecting further real-world complexity.
References:
- "Benchmarking Agentic Workflow Generation" (Qiao et al., 2024)
- "Can Compressed LLMs Truly Act? An Empirical Evaluation of Agentic Capabilities in LLM Compression" (2505.19433)