SEW: Self-Evolving Agentic Workflows for Automated Code Generation

Published 24 May 2025 in cs.SE, cs.AI, and cs.CL | (2505.18646v1)

Abstract: LLMs have demonstrated effectiveness in code generation tasks. To enable LLMs to address more complex coding challenges, existing research has focused on crafting multi-agent systems with agentic workflows, where complex coding tasks are decomposed into sub-tasks, assigned to specialized agents. Despite their effectiveness, current approaches heavily rely on hand-crafted agentic workflows, with both agent topologies and prompts manually designed, which limits their ability to automatically adapt to different types of coding problems. To address these limitations and enable automated workflow design, we propose \textbf{S}elf-\textbf{E}volving \textbf{W}orkflow (\textbf{SEW}), a novel self-evolving framework that automatically generates and optimises multi-agent workflows. Extensive experiments on three coding benchmark datasets, including the challenging LiveCodeBench, demonstrate that our SEW can automatically design agentic workflows and optimise them through self-evolution, bringing up to 33\% improvement on LiveCodeBench compared to using the backbone LLM only. Furthermore, by investigating different representation schemes of workflow, we provide insights into the optimal way to encode workflow information with text.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper presents a self-evolving workflow (SEW) framework that automates and optimizes multi-agent code generation processes.
It introduces evolutionary operators, DE and HE, to refine workflow topologies and agent prompts, achieving up to a 33% improvement on benchmarks.
Experimental evaluations on datasets like LiveCodeBench demonstrate significant gains in workflow performance and prompt effectiveness.

SEW: Self-Evolving Agentic Workflows for Automated Code Generation

This paper introduces Self-Evolving Workflow (SEW), a framework designed to automate the generation and optimization of multi-agent workflows for code generation tasks. The core idea is to leverage an evolutionary scheme that improves both the workflow topology and the prompts of individual agents, thereby enhancing the overall performance of the code generation process. The paper explores different workflow representation schemes and demonstrates the effectiveness of SEW on three coding benchmark datasets.

Key Components and Implementation

The SEW framework comprises three primary modules: workflow generation, workflow evolution, and agent evolution (Figure 1).

Figure 1: The overall framework of SEW, illustrating the workflow generation, workflow evolution, and agent evolution modules.

The workflow generation module creates initial workflows based on a task description and a template workflow. The paper explores five different representation schemes for these workflows: BPMN, CoRE, Python code, YAML, and pseudo-code. The workflow evolution module then refines the structure of the workflows using a novel evolutionary method. Finally, the agent evolution module enhances the prompts of each agent within the evolved workflow, employing direct evolution (DE) and hyper-evolution (HE) operators, driven by LLMs.

Evolutionary Operators

The DE operator, denoted as $\mathcal{F}(\cdot)$ , and the HE operator, denoted as $\mathcal{H}(\cdot)$ , are central to SEW. These operators use LLMs to generate more effective task prompts by concatenating evolutionary prompts with the initial task prompt. Direct Evolution (DE) modifies an agent's prompt by directly applying a mutation prompt, while Hyper Evolution (HE) first modifies the mutation prompt and then applies the modified prompt to the agent.

Figure 2: Direct Evolution and Hyper Evolution in SEW, showcasing the mutation prompts and their impact on agent prompts.

For instance, the first-order DE is represented as:

$a' \gets \mathcal{F}(a | \mathcal{T}_\text{mut})$

where $a$ is an agent, $a'$ is the agent with the modified prompt, and $\mathcal{T}_\text{mut}$ is the mutation prompt. The zero-order HE is represented as:

$a' \gets \mathcal{H}(a | \mathcal{H}(\mathcal{T}_\text{des} | \mathcal{T}_\text{think}))$

where $\mathcal{T}_\text{think}$ are text descriptions of general cognitive heuristics.

Workflow Representation Schemes

The paper investigates five textual representation schemes for workflows: BPMN, CoRE, Python, YAML, and pseudo-code. BPMN is a graphical modeling language that specifies the execution order of activities. CoRE integrates natural language programming, pseudo-code, and flow-based programming. Python and YAML are commonly used for defining and managing agentic workflows due to their flexibility and readability. Pseudo-code provides a high-level representation that is easy for both humans and machines to understand.

Figure 3: A workflow depicted in BPMN and CoRE, highlighting the structural differences between the two schemes.

Experimental Results and Analysis

The experimental evaluation of SEW was conducted on three benchmark datasets: MBPP, HumanEval, and LiveCodeBench. The results indicate that SEW consistently improves workflow performance through self-evolution. Specifically, SEW achieves up to a 33% improvement on LiveCodeBench compared to using the backbone LLM only.

Impact of Workflow and Agent Evolution

The paper analyzes the impact of the workflow evolution and agent evolution modules on the performance of code generation. By comparing workflows generated with and without agent evolution, the results demonstrate that the agent evolution module significantly enhances the effectiveness of the workflows. For example, on the LiveCodeBench dataset, the "task parsing workflow" achieves a pass@1 score of 42.3% with only workflow evolution, but this increases to 50.9% when both workflow and agent evolution are employed.

Comparison of Agentic Evolution Strategies

The study compares the effectiveness of DE and HE strategies. The findings suggest that HE exhibits superior robustness compared to DE, as it is less sensitive to variations in mutation prompts across different tasks. While DE can achieve higher peak performance in certain metrics, HE provides a more balanced and reliable performance profile, making it more suitable for consistency.

Figure 4: Performance comparison of Code Rewriting and Task Parsing Workflows under different agent evolution strategies on the LCB dataset, highlighting the variance in performance between HE and DE.

Logical and Generation Successful Rates

The paper introduces two metrics, the Logical Successful Rate (LSR) and the Generation Successful Rate (GSR), to measure the validity of the generated workflows. The LSR denotes the probability that the generated workflow is logically valid, while the GSR denotes the probability that the output of the workflow is executable Python code. The experimental results show that BPMN and Python achieve the highest LSR, while CoRE achieves the best GSR, indicating that CoRE is the most effective representation scheme for workflow representation and optimization.

Conclusion and Future Directions

The paper concludes that SEW offers a promising approach for agentic workflow optimization, reducing the reliance on manual workflow design and prompt engineering while improving adaptability and efficiency. The framework's ability to self-evolve and optimize workflows holds significant potential for automating complex code generation tasks.

Limitations and Future Work

The paper acknowledges several limitations, including the unknown generalization of SEW to other AI-driven tasks, the workflow execution constraints, and the dependence on the capabilities of the underlying LLM. Future work will focus on addressing these limitations to extend SEW's applicability and enhance its adaptability for broader tasks.

Markdown Report Issue