Semantic Evolution over Populations for LLM-Guided Automated Program Repair

Published 2 Apr 2026 in cs.SE | (2604.02134v1)

Abstract: LLMs have recently shown strong potential for automated program repair (APR), particularly through iterative refinement that generates and improves candidate patches. However, state-of-the-art iterative refinement LLM-based APR approaches cannot fully address challenges, including maintaining useful diversity among repair hypotheses, identifying semantically related repair families, composing complementary partial fixes, exploiting structured failure information, and escaping structurally flawed search regions. In this paper, we propose a Population-Based Semantic Evolution framework for APR iterative refinement, called EvolRepair, that formulates LLM-based APR as a semantic evolutionary algorithm. EvolRepair reformulates the search paradigm of classic genetic algorithm for APR, but replaces its syntax-based operators with semantics-aware components powered by LLMs and structured execution feedback. Candidate repairs are organized into behaviorally coherent groups, enabling the algorithm to preserve diversity, reason over repair families, and synthesize stronger candidates by recombining complementary repair insights across the population. By leveraging structured failure patterns to guide search direction, EvolRepair can both refine promising repair strategies and shift toward alternative abstractions when necessary. Our experiments show that EvolRepair substantially improves repair effectiveness over existing LLM-based APR approaches.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper introduces a novel population-based semantic evolutionary approach (EvolRepair) that leverages LLMs for automated program repair by composing distributed partial fixes.
It employs behavioral grouping and semantic mutation, enabling adaptive search shifts that overcome the premature convergence seen in iterative LLM-based repair methods.
Empirical results demonstrate significant performance gains over baselines, with improvements up to 96.63% pass@1 and high combination rates in synthesizing comprehensive repairs.

Population-Based Semantic Evolution for LLM-Guided Automated Program Repair

Introduction and Motivation

Automated Program Repair (APR) has experienced a significant paradigm shift following advances in LLM-based approaches. Traditional systems—both search/heuristic driven (e.g., GenProg, SPR, Prophet) and semantics-based (e.g., SemFix, Angelix)—have struggled notably with issues such as patch overfitting and lack of semantic generalization. While deep learning approaches have relaxed the restrictions on patch space diversity, their efficacy has ultimately been constrained by dataset coverage and edit space limitations. The emergence of LLMs reframed APR as a conditional code generation challenge, leading to agents and iterative refinement-based frameworks (exemplified by REx and ChatRepair).

Empirical analysis reveals that state-of-the-art iterative LLM APRs frequently converge prematurely to semantically narrow, locally optimal fix families, optimizing within a limited repair abstraction and failing to compose distributed partial fixes present in the candidate pool. This myopic search impedes relocation to alternative repair abstractions necessary to solve semantically difficult test cases. The failure to combine distributed correct logical components motivates the proposed methodology.

EvolRepair: Semantic Evolutionary Approach

EvolRepair proposes a population-based semantic evolutionary search, where the entire APR problem is recast as a semantic evolutionary algorithm over LLM-synthesized candidates. In contrast to classical EA approaches, which rely on syntactic mutation and crossover, EvolRepair leverages the LLM as a semantic mutation/recombination oracle and structures the search around behavioral signatures rather than syntax. Each evolutionary operator (mutation, recombination, selection) is tailored to extract maximal semantic diversity and systematically compose complementary repairs.

The search maintains a population of candidates grouped by behavioral execution similarity (using Jaccard overlap of passed test sets), enabling principled exploitation and exploration of diverse repair abstractions. Behavioral grouping, cross-group sampling, and population-level crossover facilitate the aggregation of distributed partial fixes into more globally correct candidates.

Figure 2: EvolRepair: A schematic overview of the population-based evolutionary search with semantic mutation/recombination and behavioral grouping.

Behavioral Grouping and Population Operations

The design of EvolRepair enables:

Behavioral grouping of candidates based on passed test subsets, exposing semantic regimes and decoupling the search from syntactic proximity.
Population-level recombination via LLM prompting over candidate pools, enabling direct composition of complementary partial repairs (as opposed to classical two-parent syntactic crossover).
Semantic mutation using test-case level failure feedback, allowing the LLM to generate contextually targeted refinements.
Adaptive search shifts between repair abstractions when the population is trapped in an unproductive local regime, enabling efficient escape mechanisms.
Figure 1: Example of semantic crossover combining partial repairs into a fully correct fix; colored code regions indicate inheritance from distinct parents.

The crossover operator, in particular, receives a pool of candidates (behaviorally grouped) along with their test performances and is prompted to synthesize a child that maximizes collective passed behavior, explicitly encouraging the propagation of all parent-specific correct logic.

Empirical Evaluation

The evaluation protocol benchmarks EvolRepair against competitive LLM-based APR baselines, including REx and ChatRepair, across diverse backbone models (Llama 3.3 70B, Kimi K2, DeepSeek V3.1) and a large-scale decontaminated bug dataset derived from LiveCodeBench via dataset curation with SWE-Synth. Key investigated metrics include pass@k, APR (average pass rate), test case coverage, and compositional effectiveness for partial repairs.

Overall Repair Performance

EvolRepair consistently outperforms iterative refinement baselines across all settings. For Llama 3.3, EvolRepair achieves 43.25% pass@1 versus 38.34% (REx) and 34.36% (ChatRepair), and achieves 96.63% pass@1 with DeepSeek V3.1, representing steady gains irrespective of model strength. These improvements are robust against differences in prompting cost and runtime budget. The results demonstrate that population-level evolutionary search over semantic regimes provides a superior exploration-exploitation balance versus single-trajectory refinement pipelines.

Behavioral Coverage and Partial Repair Composition

Partial repair effectiveness is measured by average best pass rate across candidates and by cumulative test case coverage of all candidates (TCC). EvolRepair exhibits a higher average partial fix quality and a significantly smaller gap between collective coverage and the best single candidate, demonstrating substantial capability to consolidate distributed partial repairs into strong solutions.

Figure 5: APR progress on a representative mini-Size-Subarray instance; only EvolRepair achieves full correctness by combining distributed partial fixes across iterations.

Semantic Crossover Analysis

Quantitative assessment of crossover effectiveness uses a “Combination Rate” metric, requiring that a child preserves unique behavioral contributions from each parent. EvolRepair’s recombination operator achieves substantial combination rates (>60% on Llama 3.3, >35% on DeepSeek), empirically verifying its ability to compose distributed correct behaviors into globally correct repairs.

Component Analysis and Hyperparameter Robustness

A comprehensive ablation study demonstrates that both crossover and behavioral grouping are critical; random or fitness-only grouping, or restricting recombination to pairwise syntax-based strategies, degrades performance. Mutational guidance through test feedback is indispensable. Sensitivity experiments indicate system robustness to moderate perturbation of evolutionary parameters, confirming the effectiveness/efficiency tradeoff of the proposed default configuration.

Implications and Future Directions

EvolRepair fundamentally elevates the unit of search and recombination from syntax-oriented, trajectory-level refinement to semantically structured, population-level repair. This exposes powerful avenues for integrating deeper forms of semantic candidate abstraction, program analysis, symbolic verification, and cross-task transfer for future APR systems. By admitting population-level partial fix composition, EvolRepair points toward more robust, less overfit patch generation regimes, with potential for seamless integration into agentic software engineering frameworks incorporating tool usage and repository-level planning.

Current limitations include reliance on test-suite coverage for behavioral signatures and evaluation on program-level (single-function) repair. Extensions to project-scale settings and integration with external correctness oracles represent important future steps.

Conclusion

EvolRepair introduces a population-based semantic evolutionary strategy for LLM-Guided Automated Program Repair, moving beyond local iterative refinement to enable systematic exploitation and recombination of semantically diverse candidate repairs. Through behavioral grouping, semantic population recombination, and adaptive search, EvolRepair achieves consistently stronger empirical results and enables direct composition of distributed partial fixes, advancing the state of LLM-based APR and providing foundational principles for future program repair methodologies.

Markdown Report Issue