Compiler-Grounded Self-Repair
- Compiler-grounded self-repair is a methodology that integrates compiler diagnostics into an automated repair loop to detect, localize, and fix static errors.
- It employs modular pipelines, sequence models, reinforcement learning, and symbolic systems to generate reliable fixes in both code and formal artifacts.
- This approach enhances CI/CD efficiency, programming education, and formal verification by delivering higher repair accuracy and reduced debugging time.
Compiler-grounded self-repair refers to automated methods for detecting, localizing, and repairing static errors in code or formal artifacts (such as theorems) by leveraging explicit, structured feedback emitted by compilers or analyzers. This paradigm grounds the entire repair loop in compiler semantics—using error codes, diagnostics, and sometimes partial parse or proof states—rather than relying solely on code context, tests, or synthetic error models. Recent advances encompass supervised, self-supervised, and reinforcement learning approaches, often integrating LLMs, discriminative classifiers, or other symbolic systems. Compiler-grounded self-repair is now central to efforts in programming education, developer tooling, CI/CD automation, and formal verification, underlining its practical and theoretical relevance.
1. Fundamental Principles and Architectural Patterns
Compiler-grounded self-repair systems systematically integrate compiler feedback at all key stages: error localization, context extraction, patch generation, validation, and iterative repair. The pipeline generally consists of:
- Error detection via compilation attempt: The compiler identifies locations and descriptions of errors, often providing error codes, line numbers, or symbolic information (e.g., goals and hypotheses in proof assistants, or partial ASTs in programming languages).
- Input representation: Systems encode both the erroneous artifact and the diagnostic context, combining raw code or proof state, surrounding context (sliding window or entire file), and structured compiler logs or error records. Some approaches normalize identifiers or error message tokens to reduce vocabulary and improve generalization (Li et al., 2022).
- Repair synthesis: Methods range from modular, discriminative edit-rankers (Chhatbar et al., 2020) to sequence-to-sequence neural models jointly predicting localization and fix (Li et al., 2022), graph neural networks encoding symbol correspondences (Yasunaga et al., 2020), and programmatic RL agents operating in patch-action space (Sun et al., 19 Sep 2025).
- Validation: Repaired artifacts are recompiled or rechecked to ensure static correctness; some frameworks also use semantic proxies (LLM-as-a-Judge) to reward or filter patches that maintain behavioral or syntactic intent (Sun et al., 19 Sep 2025, Zhang et al., 19 Apr 2026).
- Iterative or multi-round repair: If the repair is incomplete or leads to new errors, the process loops, consuming fresh diagnostics on each iteration until success or a time/resource budget is exhausted (Li et al., 2022, Fu et al., 15 Oct 2025).
Modularity—separating localization, classification, and edit-application—is a recurring design motif shown to improve efficiency and interpretability (Chhatbar et al., 2020). Most contemporary systems maintain a closed repair loop, always grounded in observable compiler feedback.
2. Algorithmic Frameworks and Representative Systems
A range of algorithmic techniques instantiate the compiler-grounded paradigm:
- Modular Discriminative Pipelines: MACER first segments the repair process into repair line localization, repair class identification, edit localization, and deterministic patch application, using hierarchical classifiers and decision trees. Abstracted tokens and bigram representations from partial ASTs, combined with compiler error codes, drive efficient, high-accuracy repairs. MACER achieves a repair accuracy of 56.6% on the DeepFix C dataset, outpacing generative and RL baselines while reducing training time by 2–800× and prediction latency by 2–4× (Chhatbar et al., 2020).
- Sequence Models with Diagnostic Prompts: TransRepair and DrRepair use Transformer and graph neural networks, respectively, with composite input encoding that fuses code, local context, and compiler diagnostics, applying joint localization and pointer-generator mechanisms to predict and repair the error (Li et al., 2022, Yasunaga et al., 2020). DrRepair’s program-feedback graph builds symbol-level edges from code to error message, yielding 68.2% full-repair on DeepFix; ablations confirm the necessity of explicit diagnostic feedback (Yasunaga et al., 2020).
- Reinforcement Learning and Reward Engineering: CCrepairBench frames repair as an MDP over code tokens, using a hybrid reward R_total = S_judge + S_compile, where S_judge is supplied by a large LLM-as-a-Judge assessing patch semantic fidelity, and S_compile is a binary reward for successful compilation. This RL scheme avoids degenerate “trivial deletion” solutions, improves Genuine Fix Rate by 8.7 points over SFT, and compresses the model size needed for strong performance (Sun et al., 19 Sep 2025).
- Self-Supervised and Iterative Co-Training: Break-It-Fix-It (BIFI) trains a breaker to generate realistic errors and a fixer to learn from both real and synthetic error pairs, filtering every intermediate artifact through a compiler critic C(x). This strictly maintains ground-truth repairability and enables complete unsupervised training, achieving 90.5% repair on GitHub-Python and 71.7% on DeepFix without labels (Yasunaga et al., 2021).
- LLM-Based CI Repair in Industry: In industrial settings, automated pipelines (“Shadow Job”) capture logs, error locations, and historical fix exemplars as LLM prompts. Iterative application and validation yield up to 63% error resolution and 83% “reasonable” patch rate in embedded C/C++ CI, with >60% of successful cases resolved within 8 minutes—hours faster than human debugging (Fu et al., 15 Oct 2025).
The following table recaps key quantitative results for major paradigms:
| Method/System | Core Approach | Repair/Success Rate | Data/Domain |
|---|---|---|---|
| MACER (Chhatbar et al., 2020) | Modular, disc. | 56.6% (DeepFix), 80.5% single-line | C/Clang, students |
| DrRepair (Yasunaga et al., 2020) | Graph + diag | 68.2% (DeepFix full-repair) | C/C++, DeepFix |
| TransRepair (Li et al., 2022) | Transformer | 82.8% (full repair, TRACER), | C, synthesized err |
| CCrepairBench (Sun et al., 19 Sep 2025) | RL,hyrbid Rwd | 81.9% compilation, 70.8% genuine fix | C++, CCrepairBench |
| BIFI (Yasunaga et al., 2021) | Self-sup/EM | 90.5% (GitHub-Python), 71.7% (C) | Python, C, unlabeled |
| Industrial LLM+CI (Fu et al., 15 Oct 2025) | LLM loop | 63% (resolution), 83% plausible | Embedded C/C++, CI |
3. Data Synthesis, Error Taxonomy, and Benchmarking
Compiler-grounded repair research is data-intensive; constructing realistic benchmarks and coverage-driven error taxonomies underpins reproducible evaluation:
- Synthetic Error Generation: Many systems artificially corrupt correct examples with errors matching distributions empirically observed in student, CI, or OSS datasets. Sample perturbations: bracket/semicolon removal, ID-type swaps, keyword/integer replacement, and deletion. Synthetic corpora, e.g., 1.8M variants in TransRepair, follow distributions derived from StackOverflow and DeepFix analytics (Li et al., 2022, Yasunaga et al., 2020).
- Error Taxonomies: Manual and empirical survey yields taxonomies for C: structure errors (21.3%), statement errors (51.5%), variable declaration (21.4%), type mismatch (2.2%), and identifier misuse (3.6%). Category-specific results highlight systematic weaknesses—semantic and cross-file/config errors persist as challenges.
- Real-World CI and OSS Pipelines: Frameworks like PhantomRun (Fu et al., 23 Feb 2026) reconstruct the exact CI environment (Docker/Make/CMake), parse logs into structured error records, and mine historical fix diffs, enabling LLM repair with up to 45% pass rate across four OSS embedded system projects and robust per-category rates (65% pass for environment errors; just 32% for hardware dependency errors).
- Semantic and Human-Proxy Evaluation: To guard against “compile-only” patches, recent RL and LLM-based frameworks score or filter candidates via semantic proxies (LLM-as-a-Judge), matching or exceeding expert inter-rater reliability (Sun et al., 19 Sep 2025). Metrics include Genuine Fix Rate, CodeBLEU, CrystalBLEU, and human plausibility ratings.
4. Advances in Control Machinery and Efficient Repair Loops
Robust, token- and latency-efficient repair in the presence of compiler feedback now leverages asynchronous execution, rollback, and checkpointing:
- Hydra (Du et al., 14 May 2026) introduces an asynchronous generator–incremental checker split. The LLM generates code tokens without per-token blocking. As the compiler verifies code at semantic boundaries (e.g., after statement parse), error events inform the repair policy, which can then trigger rollback and targeted regeneration from minimal prefixes by leveraging process checkpointing (fork/COW of the Clang parser state). This mechanism reduces repair latency by 71% and token usage by 70% compared to post-hoc or constrained decoding approaches, with near-100% static correction for C/C++ and TypeScript generation tasks.
- Search and Policy Engines: Advanced policies (Bayesian-update, group contingency minimization) are deployed to optimize expected token cost and generator/checker latency under probabilistic root-cause uncertainty. Rollback points and restoration from cached checkpoints minimize unnecessary recomputation and feedback lag in large code generations (Du et al., 14 May 2026).
- CI and Industrial Pipelines: LLM-based shadow jobs and prompt construction schemes are tightly integrated with log parsing and data extraction, with adaptive prompt sizes/options balancing locality of error context against sufficient corrective information (Fu et al., 15 Oct 2025, Fu et al., 23 Feb 2026).
5. Extensions to Formal Proofs and Neuro-Symbolic Systems
Compiler-grounded self-repair principles are ported beyond programming:
- Formal Proof Repair: APRIL (Wang et al., 3 Feb 2026) targets Lean4 proof failures, pairing erroneous proofs with full Lean JSON diagnostics, local goal/hypothesis state, and human-readable explanations, to train LLMs for single-shot, diagnostic-grounded proof repair. Fine-tuned 8B models achieve up to 34.6% pass@1 repair on systematically mutated Lean proofs, with auxiliary natural-language diagnosis benefiting both automated and human-in-the-loop workflows.
- Neuro-Symbolic and Security Repair: SynthFix (Zhang et al., 19 Apr 2026) integrates compiler-grounded symbolic rewards (AST similarity, CFG overlap, static analyzer signals) into a PPO-based RL repair loop, adaptively choosing between standard SFT and RL based on symbolic summaries of code complexity. Relative to non-adaptive baselines, the router-enabled hybrid achieves up to 18% improvement in CodeBLEU/CrystalBLEU and 32% in exact match on JavaScript and C vulnerability benchmarks.
6. Challenges, Limitations, and Open Directions
While compiler-grounded self-repair surpasses prior performance across diverse datasets and error modalities, several challenges persist:
- Cross-File and Complex Semantic Errors: Most compiler-grounded loops remain file- or line-local; cross-file dependencies, configuration errors (especially hardware or toolchain-specific), and semantically intricate bugs yield lower pass rates and remain an open research frontier (Fu et al., 15 Oct 2025, Fu et al., 23 Feb 2026).
- Zero-Shot Repair and Rare Error Classes: Modular discriminative/classifier systems struggle with classes never seen at training time. Generative or search-based extensions are suggested as remedy (Chhatbar et al., 2020).
- Evaluation Beyond Compilation: Ensuring semantic equivalence, not merely compilability, is critical; hybrid reward architectures and human/LLM adjudication form the consensus solution but are resource-intensive (Sun et al., 19 Sep 2025).
- Scalability to Large Systems: Industrial and OSS build environments are heterogeneous; abstracting CI adaptation and maintaining reproducibility across dynamic toolchains and platforms is nontrivial (Fu et al., 23 Feb 2026).
- Integration with Downstream Dynamic Testing: Systems are being extended to jointly repair dynamic test failures using analogous policy and checkpoint mechanics (e.g., Hydra’s integration of unit-test failure as an error node) (Du et al., 14 May 2026).
Potential extensions include granular AST/graph-guided representations, end-to-end compiler/IDE plugin integration, neuro-symbolic and RL search expansions, and broadening to new domains (proof engineering, configuration languages, domain-specific languages).
7. Impact and Research Landscape
Compiler-grounded self-repair has redefined the landscape of automated program and proof repair. By directly leveraging the semantics and structure encoded in compiler diagnostics and artifacts, these systems combine efficiency, accuracy, and interpretability. Their integration with CI/CD pipelines, embedded tools, proof assistants, and security analyzers is reshaping developer productivity, programming pedagogy, verification, and large-scale software maintenance. Current research is converging on modular, neuro-symbolic, and RL-enriched architectures, equipped to exploit detailed feedback and large code bases, and setting the agenda for autonomous, context-aware self-healing software (Chhatbar et al., 2020, Yasunaga et al., 2021, Li et al., 2022, Sun et al., 19 Sep 2025, Fu et al., 15 Oct 2025, Wang et al., 3 Feb 2026, Zhang et al., 19 Apr 2026, Du et al., 14 May 2026, Fu et al., 23 Feb 2026).