AFlow: Automated Agentic Workflow Generation

Updated 25 December 2025

AFlow is an automated framework that generates, optimizes, and refines agentic workflows for multi-agent LLM systems by representing tasks as formal code-graphs and AOV DAGs.
It employs evolutionary search techniques and modular design to integrate dynamic feedback, error tolerance, and real-time adjustments for enhanced efficiency and reduced execution cost.
Empirical evaluations demonstrate that AFlow achieves significant performance improvements—up to 57% gains on complex tasks—across domains like multi-hop reasoning and analog circuit design.

AFlow: Automating Agentic Workflow Generation

AFlow defines a class of frameworks and algorithms for the fully automated generation, optimization, and dynamic refinement of agentic workflows in multi-agent LLM-based systems. It replaces manual design with continuous, feedback-driven, and code-represented workflow search, integrating modularity, dynamic error-tolerance, and simultaneous optimization over correctness, efficiency, and execution cost. AFlow’s approach is grounded in formal representations such as code-graphs or activity-on-vertex (AOV) DAGs, and is instantiated in a variety of domains including multi-hop reasoning, code generation, hardware design automation, and analog circuit sizing (Zhang et al., 14 Oct 2024, Niu et al., 14 Jan 2025, Wei et al., 30 Mar 2025, Ahmadzadeh et al., 5 Nov 2025).

1. Formal Workflow Representation

AFlow expresses workflows as a directed graph of nodes and edges, where each node $N_i = (M, P_i, \tau_i, F_i)$ consists of an LLM model $M$ , a prompt $P_i$ , temperature $\tau_i$ , and output format $F_i$ . Edges $E_{i \to j}$ define control and data dependencies among nodes, encompassing linear sequences, conditional branches, and possibly loops. Operators $\mathcal{O}$ serve as reusable, higher-level abstractions for common code or decision motifs (e.g., Generate, Review, Ensemble, Test, Programmer) (Zhang et al., 14 Oct 2024, Niu et al., 14 Jan 2025). The overall workflow graph $W$ is thus a member of the search space:

$\mathcal{S}_{\mathrm{AFlow}} = \{ (P_1, ..., P_n; E; O_1, ..., O_n) \mid P_i \in \mathcal{P}, E \in \mathcal{E}, O_i \in \mathcal{O} \}.$

The workflow can alternatively be represented as an activity-on-vertex DAG $G = (V, E, A)$ for multi-agent systems, with $V$ subtasks, $E$ precedence relations, and $A$ agent assignments (Niu et al., 14 Jan 2025).

2. Automated Generation and Search

AFlow’s core innovation is to cast workflow generation and optimization as a discrete search (or evolutionary) problem over this graph-structured space, driven by objective metrics computed from real agent execution:

Initialization: Generates an initial code-graph or AOV candidate using LLM-driven proposal mechanisms (prompted with task description and optional workflow exemplars) (Zhang et al., 14 Oct 2024, Niu et al., 14 Jan 2025).
Expansion: Proposes a single node/operator/coding edit using an optimizer LLM, augmented with tree-structured experience—each modification, its performance, and its cost are stored (Zhang et al., 14 Oct 2024).
Selection: Applies a soft mixed-probability policy over the top-k best scoring workflows, with hyperparameters $\lambda$ (exploration/exploitation tradeoff) and $\alpha$ (score temperature).
Evaluation: Runs each candidate workflow on a validation set, computes explicit task metrics (e.g., pass@1, F1), and tracks execution cost.
Backpropagation: Stores outcomes so that future expansions and selections are biased towards empirically successful workflow motifs or edit paths.

This procedure is iterated until early stopping, typically determined by no further improvement in the validation metric over $n$ rounds (Zhang et al., 14 Oct 2024).

AFlow supports real-time adjustment to the workflow structure during execution to adapt to failures, data gaps, or agent bottlenecks:

Performance Feedback: All subtask outputs (status, data, agent performance) are centrally logged. If workflow execution fails for any reason, an LLM is prompted with the full current state and proposes remedial edits (e.g., insert bridging subtasks, reallocate agent responsibilities) (Niu et al., 14 Jan 2025).
Rescoring and Reconciliation: Modified workflow graphs are rescored with modularity metrics (parallelism $P_\mathrm{avg}$ , dependency complexity $C_\mathrm{dep}$ ) and reconciled with ongoing execution state.
Locality of Correction: Because tasks are modularized, only directly dependent subtasks require retuning—preserving overall workflow stability and minimizing recomputation.

This design enables error tolerance: local failures trigger only local edits, dynamic agent cloning or reassignment, and the explicit parent–child tracking in workflow graphs sharply isolates cascading errors (Niu et al., 14 Jan 2025).

4. Modularity, Robustness, and Theoretical Guarantees

AFlow introduces explicit metrics and selection criteria to maximize modularity and robustness:

Parallelism: $P_\mathrm{avg} = \frac{1}{T} \sum_t \frac{|S_t|}{|V|}$ , where $S_t$ is the set of concurrently executable nodes at step $t$ , and $T$ is the workflow depth.
Dependency Complexity: $C_\mathrm{dep} = \sqrt{ \frac{1}{|V|} \sum_{v_i \in V}( \deg(v_i) - \bar{d} )^2 }$ with $\deg(v_i)$ as node degree and $\bar{d}$ the mean degree. Theory states that adding excessive dependencies strictly decreases the expected number of successful subtask completions under random failures—supporting the focus on minimizing $C_\mathrm{dep}$ (Niu et al., 14 Jan 2025).

5. Empirical Performance and Cross-Domain Applicability

AFlow demonstrates significant gains across multi-agent code generation, math reasoning, QA, scientific code modernization, and analog circuit sizing:

On standard benchmarks (e.g., HumanEval, MBPP, MATH, GSM8K, HotpotQA, DROP), AFlow delivers a 5.7 percentage point average improvement over strong manual and automated baselines, with particularly large gains (+57% relative) on complex tasks (Zhang et al., 14 Oct 2024).
Modular, automatically searched workflows transfer robustly across Executor LLMs; smaller models equipped with optimized workflows can outperform or match larger models at a fraction ( $\sim$ 4.5%) of the inference cost (Zhang et al., 14 Oct 2024, Wei et al., 30 Mar 2025).
Dynamic workflow refinement, as tested by masking outputs at runtime, is essential: in ablation, success rates drop to zero unless online updates are enabled (Niu et al., 14 Jan 2025).
Specialized agentic workflows for domain tasks (e.g., analog circuit sizing, Fortran-to-Kokkos, Verilog generation) match or exceed expert-derived plans using modular agent roles and domain-specific operators (Ahmadzadeh et al., 5 Nov 2025, Gupta et al., 15 Sep 2025, Wei et al., 30 Mar 2025).

6. Comparison, Limitations, and Extensions

AFlow contrasts with manual workflow construction, static prompt templates, and ad-hoc multi-agent chains:

It provides explicit search and optimization, not solely LLM zero-shot generation (Zhang et al., 14 Oct 2024, Liu et al., 24 May 2025).
Fully automated operator discovery (as in A²Flow) replaces manual operator selection, boosting generalization and reducing the search space branching factor (Zhao et al., 23 Nov 2025).
Robustness extensions (RobustFlow) address consistency under paraphrasing/noise in instruction, integrating preference optimization to enforce invariance (Xu et al., 26 Sep 2025).
Meta-learning variants (AdaptFlow) enable bi-level adaptation: a general workflow initialization rapidly adapts to new tasks via LLM-generated textual feedback (Zhu et al., 11 Aug 2025).

Limitations include nontrivial per-iteration cost, reliance on explicit validation metrics, and domain transfer limitations when operator abstractions or code-graph motifs do not match new domains. Future work explores multi-modal workflows, learned surrogates for evaluation, automatic operator discovery, and hierarchical or meta-abstracted workflow modules (Zhang et al., 14 Oct 2024, Zhao et al., 23 Nov 2025, Zhu et al., 11 Aug 2025).

7. Broader Impact and Applications

AFlow frameworks enable the practical realization of scalable, interpretable, and self-correcting agentic workflows across scientific computing, automated software engineering, data pipeline orchestration, and beyond. By unifying executable-code representations, execution-based feedback, and high-level modularity, AFlow architectures generalize across domains and LLM backbones, and can be specialized for dynamic, high-stakes workflows in enterprise automation, scientific discovery, financial modeling, and hardware/software co-design (Zhang et al., 14 Oct 2024, Niu et al., 14 Jan 2025, Wei et al., 30 Mar 2025, Gupta et al., 15 Sep 2025, Ang et al., 19 Aug 2025).