Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

91 tokens/sec

GPT-4o

12 tokens/sec

Gemini 2.5 Pro Pro

o3 Pro

5 tokens/sec

GPT-4.1 Pro

37 tokens/sec

DeepSeek R1 via Azure Pro

33 tokens/sec

Gemini 2.5 Flash Deprecated

12 tokens/sec

2000 character limit reached

WorkflowBench Dataset

Updated 15 July 2025

WorkflowBench Dataset is a collection of diverse benchmarks standardizing workflow modeling, automation, and reproducibility for edge, scientific, and enterprise applications.
It employs methodologies like hierarchical task abstraction and automated workflow generation to replicate real-world operational complexity.
The benchmarks support performance evaluation using metrics such as resource utilization and makespan estimation, enabling actionable insights for optimization and anomaly detection.

The WorkflowBench Dataset refers to a collection of datasets and benchmarks central to the evaluation, orchestration, and reproducibility of complex workflows in computational, edge, enterprise, and business process settings. Multiple distinct datasets share the name “WorkflowBench” or closely related titles, each aiming to address different aspects of workflow modeling, automation, anomaly detection, reproducibility, process management, and performance evaluation.

1. Definition and Scope

WorkflowBench, as represented in the research corpus, is not a single dataset but an umbrella term applied to various benchmarks that systematize the modeling, evaluation, and reproducibility of workflows—sets of orchestrated tasks in fields such as edge computing, scientific computation, business process automation, and more. The scope of published WorkflowBench datasets includes:

Workflow-based benchmarking for edge computing (2010.14027)
Automated generation of scientific workflow benchmarks (2210.03170)
Curated workflow execution datasets for anomaly detection (2306.09930)
Large-scale collections for data-centric fine-tuning and hierarchical orchestration with LLMs (2411.05451)
Reproducibility benchmarking across computational sciences (2504.08684)
Benchmarks for natural language-to-business process automation (2505.11646)

All these benchmarks are unified by their emphasis on capturing realistic workflow logic, resource heterogeneity, domain diversity, and the operational challenges inherent in orchestrating and analyzing real-world workflow systems.

2. Dataset Construction and Representation

Workflow Orchestration and Hierarchical Modeling

Several WorkflowBench datasets emphasize a function-based or hierarchical abstraction. For instance, in edge computing, “functions” act as workflow units whose configuration (in YAML or code), distribution (across IoT, edge, and cloud tiers), data storage backend selection, and invocation logic are explicitly defined (2010.14027). This modular decomposition supports both simple pipelines and complex branching, one-to-many, or scheduled (cron) workflows.

For LLM orchestration, the dataset construction pipeline involves:

Data collection from platforms such as Apple Shortcuts or RoutineHub
Transcription to code-like abstractions (e.g., Python ASTs) using context-sensitive variable renaming and hierarchical task annotation (high-level queries, intermediate plans, low-level comments)
Query expansion for coverage of diverse APIs and workflows using prompting with in-context examples (2411.05451)

Workflow Task Benchmarking

Automated generation methodologies synthesize workflow task graphs with arbitrary resource and dependency characteristics:

Tasks are parameterized over CPU, memory, I/O requirements, and dependency graphs derived from real scientific workflows (e.g., Montage) (2210.03170).
Each node encodes both performance features and input/output data, allowing for high-fidelity benchmarking that reflects production-like heterogeneity and structure.

Documentation and Standardization

In the context of reproducibility, each experiment is meticulously documented with:

“Read-for-reproducibility” files specifying environment, dependencies, and execution instructions
Explicit metadata annotation (languages used, experiment size, computational requirements)
Standards for software artifact badging and packaging (2504.08684)

This systematic approach underpins the dataset’s suitability as a universal benchmark for workflow management and reproducibility tools.

3. Benchmarking and Evaluation Methodologies

Performance Metrics

Evaluation across WorkflowBench datasets occurs at multiple granularities:

Resource Utilization: CPU, memory, and I/O metrics at function or task granularity (often collected via Prometheus or workflow logs) (2010.14027)
Workflow-level Metrics: Overall latency, throughput, bottleneck identification using formulas such as:

$T_{\text{total}} = \sum_{i=1}^{N} (T_{\text{compute},i} + T_{\text{storage},i} + T_{\text{network},i})$

Makespan Estimation: Macro-task models for workflow completion time as a function of I/O and CPU bandwidth and parallelism (2210.03170)

Automated and Outcome-centric Evaluation

Agent and LLM-based workflow benchmarks apply automated, outcome-centric assessments: success is defined by reaching the correct final workflow state, independent of the exact sequence of tool calls (2405.00823).
Precision, recall, F1, ECDF, and task-completion-based metrics are used for both documentation and process validation (2406.13264, 2411.05451).

Anomaly Detection and Generalization

Datasets tailored for anomaly detection incorporate injected perturbations (CPU/memory/disk throttling via cgroups) and comprehensive ground-truth labeling (2306.09930).
Evaluation encompasses classical ML, GNNs, and deep learning (e.g., ROC-AUC, average precision), enabling measurement and comparison of method robustness on real execution graphs.

4. Application Domains and Representative Use Cases

Edge and Distributed Systems

WorkflowBench delivers comprehensive benchmarks for end-to-end edge workload evaluation:

Video analytics workflows (multi-stage streaming, detection, and recognition)
IoT hub workflows (mass sensor data ingestion, model training, real-time querying)
Workflow design optimization (e.g., where to execute CPU-intensive or latency-sensitive stages across IoT/edge/cloud tiers) (2010.14027)

Scientific Computation

Automated workflow generation supports:

Reproducible, large-scale task-graph instantiation on supercomputing platforms (e.g., ORNL Summit)
Benchmarking workflow system performance under varying load, task parallelism, data sizes, and platform heterogeneity (2210.03170)

Business Process Automation and Enterprise Applications

Diverse instantiations include:

Robust process documentation, multi-modal task validation, and SOP generation at enterprise scale (2406.13264)
NL-to-BPMN conversion: datasets enable mapping natural language requests to process artifacts via Python-based intermediate representations and subsequent BPMN code generation (2505.11646)
LLM orchestration tasks, fine-tuned across thousands of APIs and categories, facilitating both in-distribution and generalizable process automation (2411.05451)

Reproducibility Benchmarking

Curated experiments span multiple languages and complexity levels to benchmark the reproducibility of scientific results and the tools supporting such efforts, providing baseline success/failure rates for workflow reproduction across domains (2504.08684).

5. Data Access, Licensing, and Standardization

WorkflowBench datasets are typically made available under permissive licenses. For example:

The anomaly detection benchmark is distributed under the MIT License, with all logs and parsing artifacts in public repositories (2306.09930).
Large-scale LLM orchestration workflow datasets and the fine-tuned model derivatives are available as open-source packages (2411.05451).
Explicit dataset versioning using persistent DOIs is adopted in reproducibility benchmarking (2504.08684).

Metadata and artifact badging (per ACM guidelines) are integral to enable systematic and fair evaluation.

6. Limitations and Research Opportunities

The empirical results across WorkflowBench variants indicate persistent research challenges:

Existing LLMs—even state-of-the-art—struggle with complex task planning, precise multi-stage orchestration, and step-level validation, as reflected by modest F1 scores and low task completion rates (2406.13264, 2411.05451).
Anomaly detection methods face scalability and recall challenges on large DAGs, with room for innovations in high-dimensional, non-stationary graph analysis (2306.09930).
Reproducibility tools reveal gaps in handling diverse configurations, undocumented dependencies, and environmental heterogeneity. Standardized benchmarking surfaces both strengths and practical limitations of current approaches (2504.08684).

A plausible implication is a widespread need for further research into hierarchical task modeling, robust API generalization, fine-grained process documentation, and hybrid statistical–deep learning anomaly detection.

7. Significance and Impact

The proliferation and standardization of WorkflowBench-style datasets mark a significant advancement for workflow system evaluation, LLM-based orchestration, scientific reproducibility, and enterprise process automation. These benchmarks enable:

Objective, multi-dimensional evaluation of methods and systems
Reproducible comparative studies across diverse operational domains
Empirical grounding for theoretical advances in workflow scheduling, process mining, and anomaly detection

The multi-domain nature of WorkflowBench ensures its continued relevance as computational systems become more complex, heterogeneous, and interdependent across edge, cloud, scientific, enterprise, and automation environments.