- The paper introduces a novel population-based semantic evolutionary approach (EvolRepair) that leverages LLMs for automated program repair by composing distributed partial fixes.
- It employs behavioral grouping and semantic mutation, enabling adaptive search shifts that overcome the premature convergence seen in iterative LLM-based repair methods.
- Empirical results demonstrate significant performance gains over baselines, with improvements up to 96.63% pass@1 and high combination rates in synthesizing comprehensive repairs.
Population-Based Semantic Evolution for LLM-Guided Automated Program Repair
Introduction and Motivation
Automated Program Repair (APR) has experienced a significant paradigm shift following advances in LLM-based approaches. Traditional systems—both search/heuristic driven (e.g., GenProg, SPR, Prophet) and semantics-based (e.g., SemFix, Angelix)—have struggled notably with issues such as patch overfitting and lack of semantic generalization. While deep learning approaches have relaxed the restrictions on patch space diversity, their efficacy has ultimately been constrained by dataset coverage and edit space limitations. The emergence of LLMs reframed APR as a conditional code generation challenge, leading to agents and iterative refinement-based frameworks (exemplified by REx and ChatRepair).
Empirical analysis reveals that state-of-the-art iterative LLM APRs frequently converge prematurely to semantically narrow, locally optimal fix families, optimizing within a limited repair abstraction and failing to compose distributed partial fixes present in the candidate pool. This myopic search impedes relocation to alternative repair abstractions necessary to solve semantically difficult test cases. The failure to combine distributed correct logical components motivates the proposed methodology.
EvolRepair: Semantic Evolutionary Approach
EvolRepair proposes a population-based semantic evolutionary search, where the entire APR problem is recast as a semantic evolutionary algorithm over LLM-synthesized candidates. In contrast to classical EA approaches, which rely on syntactic mutation and crossover, EvolRepair leverages the LLM as a semantic mutation/recombination oracle and structures the search around behavioral signatures rather than syntax. Each evolutionary operator (mutation, recombination, selection) is tailored to extract maximal semantic diversity and systematically compose complementary repairs.
The search maintains a population of candidates grouped by behavioral execution similarity (using Jaccard overlap of passed test sets), enabling principled exploitation and exploration of diverse repair abstractions. Behavioral grouping, cross-group sampling, and population-level crossover facilitate the aggregation of distributed partial fixes into more globally correct candidates.
Figure 2: EvolRepair: A schematic overview of the population-based evolutionary search with semantic mutation/recombination and behavioral grouping.
Behavioral Grouping and Population Operations
The design of EvolRepair enables:
The crossover operator, in particular, receives a pool of candidates (behaviorally grouped) along with their test performances and is prompted to synthesize a child that maximizes collective passed behavior, explicitly encouraging the propagation of all parent-specific correct logic.
Empirical Evaluation
The evaluation protocol benchmarks EvolRepair against competitive LLM-based APR baselines, including REx and ChatRepair, across diverse backbone models (Llama 3.3 70B, Kimi K2, DeepSeek V3.1) and a large-scale decontaminated bug dataset derived from LiveCodeBench via dataset curation with SWE-Synth. Key investigated metrics include pass@k, APR (average pass rate), test case coverage, and compositional effectiveness for partial repairs.
EvolRepair consistently outperforms iterative refinement baselines across all settings. For Llama 3.3, EvolRepair achieves 43.25% pass@1 versus 38.34% (REx) and 34.36% (ChatRepair), and achieves 96.63% pass@1 with DeepSeek V3.1, representing steady gains irrespective of model strength. These improvements are robust against differences in prompting cost and runtime budget. The results demonstrate that population-level evolutionary search over semantic regimes provides a superior exploration-exploitation balance versus single-trajectory refinement pipelines.
Behavioral Coverage and Partial Repair Composition
Partial repair effectiveness is measured by average best pass rate across candidates and by cumulative test case coverage of all candidates (TCC). EvolRepair exhibits a higher average partial fix quality and a significantly smaller gap between collective coverage and the best single candidate, demonstrating substantial capability to consolidate distributed partial repairs into strong solutions.
Figure 5: APR progress on a representative mini-Size-Subarray instance; only EvolRepair achieves full correctness by combining distributed partial fixes across iterations.
Semantic Crossover Analysis
Quantitative assessment of crossover effectiveness uses a “Combination Rate” metric, requiring that a child preserves unique behavioral contributions from each parent. EvolRepair’s recombination operator achieves substantial combination rates (>60% on Llama 3.3, >35% on DeepSeek), empirically verifying its ability to compose distributed correct behaviors into globally correct repairs.
Component Analysis and Hyperparameter Robustness
A comprehensive ablation study demonstrates that both crossover and behavioral grouping are critical; random or fitness-only grouping, or restricting recombination to pairwise syntax-based strategies, degrades performance. Mutational guidance through test feedback is indispensable. Sensitivity experiments indicate system robustness to moderate perturbation of evolutionary parameters, confirming the effectiveness/efficiency tradeoff of the proposed default configuration.
Implications and Future Directions
EvolRepair fundamentally elevates the unit of search and recombination from syntax-oriented, trajectory-level refinement to semantically structured, population-level repair. This exposes powerful avenues for integrating deeper forms of semantic candidate abstraction, program analysis, symbolic verification, and cross-task transfer for future APR systems. By admitting population-level partial fix composition, EvolRepair points toward more robust, less overfit patch generation regimes, with potential for seamless integration into agentic software engineering frameworks incorporating tool usage and repository-level planning.
Current limitations include reliance on test-suite coverage for behavioral signatures and evaluation on program-level (single-function) repair. Extensions to project-scale settings and integration with external correctness oracles represent important future steps.
Conclusion
EvolRepair introduces a population-based semantic evolutionary strategy for LLM-Guided Automated Program Repair, moving beyond local iterative refinement to enable systematic exploitation and recombination of semantically diverse candidate repairs. Through behavioral grouping, semantic population recombination, and adaptive search, EvolRepair achieves consistently stronger empirical results and enables direct composition of distributed partial fixes, advancing the state of LLM-based APR and providing foundational principles for future program repair methodologies.