- The paper introduces EvoFlow, an evolutionary computation framework that automatically discovers and optimizes diverse, cost-effective agentic workflows.
- EvoFlow frames workflow search as a multi-objective problem, balancing performance and cost using tag-based retrieval, crossover, mutation, and niching selection.
- Evaluations show EvoFlow outperforms SOTA baselines, achieving significant cost reduction (e.g., 12.4% of cost for better performance than o1-preview) while maintaining high performance and diversity across various tasks.
The paper introduces EvoFlow, a niching EA-based framework designed to automate the search for heterogeneous and complexity-adaptive agentic workflows. Addressing the limitations of existing agentic automation pipelines, which often lack LLM heterogeneity and focus on single-objective performance optimization, EvoFlow aims to optimize a diverse set of workflows that can provide customized and cost-effective solutions.
The core technical contributions of EvoFlow include:
- Tag-based retrieval to extract parent workflows from an agentic population.
- Crossover and mutation operations to evolve new workflows.
- Niching-based selection to maintain population diversity and quality.
EvoFlow frames the agentic search as a multi-objective optimization problem, balancing cost and performance to generate a Pareto-optimal set of workflows. The search space uses operator nodes, which are LLM-agent invoking nodes, as fundamental units. The workflow population is initialized by selecting and combining multiple operator nodes and continuously evolves by processing incoming queries.
The search space of EvoFlow is defined hierarchically, with the basic unit being the invoking node Ii, which is defined as:
Ii=(Mi,Pi,Ti),Pi∈P,Ti∈[0,1],
where
- Pi represents the associated prompt, with P denoting the feasible prompt space,
- Ti is the temperature parameter,
- Mi=(∣Mi∣,Ci,Li) represents an LLM instance from the feasible model pool M={M1,...,MM}, characterized by its model size ∣Mi∣, token cost Ci, and inference delay Li.
The operator node Oj is represented by:
Oj=(Ij,ξj),Ij={I1,...,In},ξj⊆Ij×Ij,
where
- Ij is a subset of invoking nodes,
- ξj signifies the connectivity relationship among invoking nodes.
The overall agentic workflow G is defined as:
G=(OS,δa), OS={O1,...,Om}, δa⊆OS×OS,
where
- OS⊆O, IS⊆I, m denotes the number of operator nodes in G,
- δa denotes inter-operator connections.
EvoFlow's optimization objective is multi-objective:
G∗=G∈H(O,δa)argmax[u(G,T),−c(G,T)]T,
where
- c(⋅) evaluates the system cost,
- G∗ represents the Pareto optimal set balancing cost and performance,
- H(O,δa) denotes the operator node-based search space for G.
The methodology involves population initialization, tag-based retrieval, crossover, mutation, and niching-based selection. The initial population comprises a diverse set of workflows tagged with domain expertise. Tag-based retrieval selects relevant workflows as parents, and crossover and mutation generate offspring workflows. Niching selection maintains diversity and quality in the population.
EvoFlow was evaluated across seven benchmarks, demonstrating its diversity, high performance, and economy. In heterogeneous settings, EvoFlow surpassed the performance of o1-preview
at 12.4% of its inference cost using weaker open-source models, such as LLaMa-3.1-70b and QWen-2.5-72b. In homogeneous settings, EvoFlow outperformed SOTA (State-of-the-Art) agentic workflows by 1.23% ~ 29.86% in performance. Additionally, EvoFlow exhibited a training cost of one-third of the SOTA baseline AFlow ($0.45 vs$1.23) and an inference cost of one-fifth ($0.51 vs$2.62), while surpassing AFlow by 5.91% on the MATH Benchmark.
The framework's evolutionary process operates on a query-by-query basis, continuously evolving, mutating, and niching-selecting workflows in response to incoming queries. This iterative process gradually produces a Pareto set of agentic workflows with varying complexity and performance.
Experiments were conducted on six public benchmarks covering math reasoning (GSM8K, MATH, and MultiArith), code generation (HumanEval and MBPP), and embodied tasks (ALFWorld). Baselines included manually designed workflows (CoT, ComplexCoT, SC, LLM-Debate, LLM-Blender, DyLAN, AgentVerse, and MacNet) and autonomous workflows (GPTSwarm, AutoAgents, ADAS, AgentSquare, and AFlow). The LLM backbones used were gpt-4o-mini-0718
, llama-3.1-70b
, Qwen-2-72b
, Deepseek-V2.5
, and Hermes-3-70b
.
EvoFlow's performance was analyzed in homogeneous, heterogeneous, and cross-domain settings. Homogeneous performance analysis showed that EvoFlow outperformed existing hand-crafted or automated agentic workflows across six benchmarks. Heterogeneous performance demonstrated that EvoFlow, through the collective assembly and evolution of open-source models, surpassed o1-preview
by 2.7% on the MATH benchmark, with an overall cost merely 12.4% of that of o1-preview
. Cross-domain performance analysis indicated that EvoFlow successfully benefited from cross-domain training on MBPP, improving from 87.62% to 88.35%.
A cost analysis demonstrated EvoFlow's resource-friendly nature in terms of training/inference API (Application Programming Interface) costs and token consumption. The optimized heterogeneous population of EvoFlow forms a Pareto front, ranging from simple and inexpensive workflows to more complex workflows incorporating multi-agent debate.
Ablation studies on variants of EvoFlow (w/o tag, w/o LLM mutation, w/o prompt mutation, and w/o operator mutation) revealed that removing tag-based retrieval and LLM mutation consistently leads to performance degradation. Sensitivity analysis on key parameters (number of selected parents K, number of tags per individual κ, and population size N) showed that both too small and too large K result in performance degradation, while increasing N consistently improves performance.