Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EvoFlow: Evolving Diverse Agentic Workflows On The Fly (2502.07373v1)

Published 11 Feb 2025 in cs.LG, cs.CL, cs.MA, and cs.NE

Abstract: The past two years have witnessed the evolution of LLM-based multi-agent systems from labor-intensive manual design to partial automation (\textit{e.g.}, prompt engineering, communication topology) and eventually to fully automated design. However, existing agentic automation pipelines often lack LLM heterogeneity and focus on single-objective performance optimization, limiting their potential to combine weaker models for more customized and cost-effective solutions. To address this challenge, we propose EvoFlow, a niching evolutionary algorithm-based framework to automatically search a population of heterogeneous and complexity-adaptive agentic workflows, rather than a single homogeneous, complex workflow. Technically, EvoFlow performs \textit{(1) tag-based retrieval} to extract parent workflows from an agentic population, evolves new workflows through \textit{(2) crossover} and \textit{(3) mutation}, and employs \textit{(4) niching-based selection} to maintain population diversity and quality. Extensive evaluations across seven benchmarks demonstrate that EvoFlow is: \textbf{(I) diverse}, evolving a population of workflows ranging from simple I/O tasks to complex multi-turn interactions; \textbf{(II) high-performing}, outperforming previous handcrafted and automated workflows by $1.23\%\sim29.86\%$; \textbf{(III) economical}, surpassing powerful \LLMname{o1-preview} at $12.4\%$ of its inference cost using weaker open-source models.

Summary

  • The paper introduces EvoFlow, an evolutionary computation framework that automatically discovers and optimizes diverse, cost-effective agentic workflows.
  • EvoFlow frames workflow search as a multi-objective problem, balancing performance and cost using tag-based retrieval, crossover, mutation, and niching selection.
  • Evaluations show EvoFlow outperforms SOTA baselines, achieving significant cost reduction (e.g., 12.4% of cost for better performance than o1-preview) while maintaining high performance and diversity across various tasks.

The paper introduces EvoFlow, a niching EA-based framework designed to automate the search for heterogeneous and complexity-adaptive agentic workflows. Addressing the limitations of existing agentic automation pipelines, which often lack LLM heterogeneity and focus on single-objective performance optimization, EvoFlow aims to optimize a diverse set of workflows that can provide customized and cost-effective solutions.

The core technical contributions of EvoFlow include:

  • Tag-based retrieval to extract parent workflows from an agentic population.
  • Crossover and mutation operations to evolve new workflows.
  • Niching-based selection to maintain population diversity and quality.

EvoFlow frames the agentic search as a multi-objective optimization problem, balancing cost and performance to generate a Pareto-optimal set of workflows. The search space uses operator nodes, which are LLM-agent invoking nodes, as fundamental units. The workflow population is initialized by selecting and combining multiple operator nodes and continuously evolves by processing incoming queries.

The search space of EvoFlow is defined hierarchically, with the basic unit being the invoking node IiI_i, which is defined as:

Ii=(Mi,Pi,Ti),PiP,Ti[0,1]I_i = (M_i, P_i, T_i), P_i \in P, T_i \in [0,1],

where

  • PiP_i represents the associated prompt, with PP denoting the feasible prompt space,
  • TiT_i is the temperature parameter,
  • Mi=(Mi,Ci,Li)M_i = (|M_i|, C_i, L_i) represents an LLM instance from the feasible model pool M={M1,...,MM}M = \{M_1, ..., M_M\}, characterized by its model size Mi|M_i|, token cost CiC_i, and inference delay LiL_i.

The operator node OjO_j is represented by:

Oj=(Ij,ξj),Ij={I1,...,In},ξjIj×IjO_j = (I_j, \xi_j), I_j = \{I_1, ..., I_n\}, \xi_j \subseteq I_j \times I_j,

where

  • IjI_j is a subset of invoking nodes,
  • ξj\xi_j signifies the connectivity relationship among invoking nodes.

The overall agentic workflow GG is defined as:

G=(OS,δa)G = (O_S, \delta^a), OS={O1,...,Om}O_S = \{O_1, ..., O_m\}, δaOS×OS\delta^a \subseteq O_S \times O_S,

where

  • OSOO_S \subseteq O, ISII_S \subseteq I, mm denotes the number of operator nodes in GG,
  • δa\delta^a denotes inter-operator connections.

EvoFlow's optimization objective is multi-objective:

G=argmaxGH(O,δa)[u(G,T),c(G,T)]TG^* = \underset{G \in H(O, \delta^a)}{\arg \max} [u(G,T), -c(G,T)]^T,

where

  • c()c(\cdot) evaluates the system cost,
  • GG^* represents the Pareto optimal set balancing cost and performance,
  • H(O,δa)H(O, \delta^a) denotes the operator node-based search space for GG.

The methodology involves population initialization, tag-based retrieval, crossover, mutation, and niching-based selection. The initial population comprises a diverse set of workflows tagged with domain expertise. Tag-based retrieval selects relevant workflows as parents, and crossover and mutation generate offspring workflows. Niching selection maintains diversity and quality in the population.

EvoFlow was evaluated across seven benchmarks, demonstrating its diversity, high performance, and economy. In heterogeneous settings, EvoFlow surpassed the performance of o1-preview at 12.4% of its inference cost using weaker open-source models, such as LLaMa-3.1-70b and QWen-2.5-72b. In homogeneous settings, EvoFlow outperformed SOTA (State-of-the-Art) agentic workflows by 1.23% ~ 29.86% in performance. Additionally, EvoFlow exhibited a training cost of one-third of the SOTA baseline AFlow ($0.45 vs$1.23) and an inference cost of one-fifth ($0.51 vs$2.62), while surpassing AFlow by 5.91% on the MATH Benchmark.

The framework's evolutionary process operates on a query-by-query basis, continuously evolving, mutating, and niching-selecting workflows in response to incoming queries. This iterative process gradually produces a Pareto set of agentic workflows with varying complexity and performance.

Experiments were conducted on six public benchmarks covering math reasoning (GSM8K, MATH, and MultiArith), code generation (HumanEval and MBPP), and embodied tasks (ALFWorld). Baselines included manually designed workflows (CoT, ComplexCoT, SC, LLM-Debate, LLM-Blender, DyLAN, AgentVerse, and MacNet) and autonomous workflows (GPTSwarm, AutoAgents, ADAS, AgentSquare, and AFlow). The LLM backbones used were gpt-4o-mini-0718, llama-3.1-70b, Qwen-2-72b, Deepseek-V2.5, and Hermes-3-70b.

EvoFlow's performance was analyzed in homogeneous, heterogeneous, and cross-domain settings. Homogeneous performance analysis showed that EvoFlow outperformed existing hand-crafted or automated agentic workflows across six benchmarks. Heterogeneous performance demonstrated that EvoFlow, through the collective assembly and evolution of open-source models, surpassed o1-preview by 2.7% on the MATH benchmark, with an overall cost merely 12.4% of that of o1-preview. Cross-domain performance analysis indicated that EvoFlow successfully benefited from cross-domain training on MBPP, improving from 87.62% to 88.35%.

A cost analysis demonstrated EvoFlow's resource-friendly nature in terms of training/inference API (Application Programming Interface) costs and token consumption. The optimized heterogeneous population of EvoFlow forms a Pareto front, ranging from simple and inexpensive workflows to more complex workflows incorporating multi-agent debate.

Ablation studies on variants of EvoFlow (w/o tag, w/o LLM mutation, w/o prompt mutation, and w/o operator mutation) revealed that removing tag-based retrieval and LLM mutation consistently leads to performance degradation. Sensitivity analysis on key parameters (number of selected parents KK, number of tags per individual κ\kappa, and population size NN) showed that both too small and too large KK result in performance degradation, while increasing NN consistently improves performance.

X Twitter Logo Streamline Icon: https://streamlinehq.com