Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications

Published 19 Apr 2026 in cs.SE and cs.AI | (2604.17464v1)

Abstract: The transition from neural machine translation to agentic workflows has revolutionized Automated Program Repair (APR). However, existing agents, despite their advanced reasoning capabilities, frequently suffer from the Intent Gap'' -- the misalignment between the generated patch and the developer's original intent. Current solutions relying on natural language summaries or adversarial sampling often fail to provide the deterministic constraints required for surgical repairs. In this paper, we introduce \textsc{Prometheus}, a novel framework that bridges this gap by prioritizing \textit{Specification Inference} over code generation. We employ Behavior-Driven Development (BDD) as an executable contract, utilizing a multi-agent architecture to reverse-engineer Gherkin specifications from runtime failure reports. To resolve theHallucination of Intent,'' we propose a \textbf{Requirement Quality Assurance (RQA) Loop}, a mechanism that leverages ground-truth code as a proxy oracle to validate inferred specifications. We evaluated \textsc{Prometheus} on 680 defects from the Defects4J benchmark. The results are transformative: our framework achieved a total correct patch rate of \textbf{93.97\%} (639/680). More significantly, it demonstrated a \textbf{Rescue Rate of 74.4\%}, successfully repairing 119 complex bugs that a strong blind agent failed to resolve. Qualitative analysis reveals that explicit intent guides agents away from structurally invasive over-engineering toward precise, minimal corrections. Our findings suggest that the future of APR lies not in larger models, but in the capability to align code with verified, \textbf{Executable Specifications} -- whether pre-existing or reverse-engineered.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a multi-agent framework that infers and validates executable BDD specifications to closely align program repairs with developer intent.
The methodology employs distinct roles—Architect, Engineer, and Fixer—to diagnose failures, validate intent, and apply targeted patches.
Empirical results show a 93.97% correct patch rate and 74.4% rescue rate, demonstrating significant improvement over state-of-the-art APR techniques.

Executable Specification-Guided Agentic Program Repair with Prometheus

Introduction

The paper "Project Prometheus: Bridging the Intent Gap in Agentic Program Repair via Reverse-Engineered Executable Specifications" (2604.17464) presents Prometheus, a multi-agent framework designed to overcome the persistent "Intent Gap" present in current Automated Program Repair (APR) pipelines. Contrasting with approaches that depend primarily on natural language summaries or test-driven optimization, Prometheus foregrounds the inference and rigorous validation of Behavior-Driven Development (BDD) executable specifications as the central artifact for aligning repair agents with developer intent.

Motivation and Problem Analysis

The proliferation of agentic workflows in APR, powered by LLMs and multi-agent collaboration, has enabled greater semantic reasoning and context awareness in code repair tasks. Nonetheless, these systems are fundamentally limited by ambiguity in intent extraction—relying on proxies such as natural language function summaries, which are inherently descriptive rather than prescriptive, or on dynamically generated adversarial test cases, which lack determinism and contractual rigor. This frequently leads to hallucinated or "Berserker-style" fixes: structurally invasive, contextually unfocused patches that diverge from the intended resolution.

The Prometheus framework is motivated by the hypothesis that agentic code repair can only attain robust minimality and precision through tightly constrained, executable requirements. The authors formalize the "Intent-Behavior Mirroring Effect," wherein the scope and structural invasiveness of an agent's code modifications reflect the ambiguity or precision of the input requirement. As such, explicit, executable specifications are posited as a prerequisite for "Surgical-style" minimal repair.

Framework Overview

Prometheus introduces a modular, role-specialized agentic workflow:

Figure 1: The Prometheus architecture delineating the Architect, Engineer, and Fixer roles across specification inference, validation, and repair.

Architect (Gemini-3.0-Pro): Given a runtime failure report and a buggy codebase, this agent performs fault localization, structured root cause analysis, and synthesizes the developer's missing intent as a BDD (Gherkin) specification.
Engineer: Executes an RQA (Requirement Quality Assurance) Loop, using ground truth code as proxy oracle to validate that the synthesized BDD scenario captures the specific defect—requiring it to fail on buggy code and pass on the fixed code.
Fixer (Qwen-3.0-Coder): Guided by the validated BDD scenario, applies a "specification-first" patch, limited to the primary suspicious file.

This design is architected to decouple the problems of intent inference, validation, and patch generation, and to empirically isolate the impact of specification precision from fault localization and code search.

Methodology

Specification Inference and Validation

The framework innovates by tasking the Architect with structured, two-phase cognitive inference: (1) root cause diagnosis via analysis of code and test failures, and (2) formalization of requirements in the Gherkin DSL. The use of BDD scenarios as the format for executable intent ensures that requirements are both cognitively parsable for humans and directly testable as contracts.

A critical advance is the RQA Loop: the Engineer validates specifications by ensuring they fail on buggy code and pass on developer-verified, fixed code (when available). This sandwich verification operationalizes the concept of "intent correctness" in a deterministic manner.

Controlled Repair Application

To rigorously evaluate impact, the Fixer's scope is artificially constrained to a single file. This eliminates confounds from multi-file reasoning, attributing empirical results solely to the presence or absence of executable specifications.

Empirical Evaluation

Prometheus is empirically evaluated on 680 non-trivial defects from 16 popular Java projects (Defects4J v3.0.1), omitting only Closure due to prohibitive build and environment requirements.

Quantitative highlights:

Correct Patch Rate: 93.97% (639/680 defects), a substantial improvement over prior SOTA.
Rescue Rate: 74.4%—explicit BDD intent enabled successful repair of 119/160 bugs intractable to a strong "blind" Fixer baseline.
Comparative advantage: On "hard" bugs, Prometheus outperformed TSAPR and RepairAgent by 4.4 $\times$ in absolute fixes.

Notably, the overhead for intent inference (Architect) constituted only 6.4% of computational resource expenditure. The validation phase (Engineer) dominated, primarily due to test harness adaptation for legacy and non-standard environments—a tractable challenge in production deployments with persistent suites.

Qualitative Analysis

Prometheus demonstrates that explicit BDD constraints not only increase repair yield but also systematically suppress hallucinated or over-generalized code. Several phenomena are observed:

Reduction of hallucination: E.g., in Chart-13, BDD-rooted requirements averted spurious variable introduction, focusing agent logic on the required context.
Avoidance of "shortcut" repairs: In Chart-7, explicit scenario coverage enforced proper conditional logic, preventing broad regressive fixes.
Beyond ground truth: Enlightened agents sometimes deliver robustness exceeding developer patches, as observed in JxPath-6.

Additionally, the architecture exposes a new class of failure—"Cognitive Dissonance Hallucination"—when the agent is specification-constrained beyond the physical scope of its permitted patch region, stressing the need for unrestricted, multi-file repair.

Discussion and Architectural Implications

The experimental evidence substantiates the central claim: the bottleneck in advanced APR is now the formulation and verification of precise intent, not search or synthesis capability per se. BDD-guided repair contracts are shown to stabilize the reasoning of advanced code agents, inducing minimal, semantically aligned changes and robustly suppressing both overfitting and underfitting phenomena.

Limiting factors include the challenge of capturing implicit behavior (e.g., wire protocols invisible at API level) and the cost of "digital archaeology" necessary for dynamic environment configuration in legacy codebases.

Prometheus provides compelling evidence that scaling agentic repair requires a "specification-first" paradigm, focusing research on effective requirement mining, validation, and integration with multi-agent reasoning architectures.

Future Directions

The authors propose extensions decoupling fault localization (introducing a "Detective" agent), scaling towards system-level multi-file repair, and deeper integration with Agent Skills architectures for persistent, self-evolving specification artifacts. This anticipates a future in which software repositories evolve as "living documentation" with agentically verified, executable specs at their core.

Conclusion

Prometheus repositions APR as primarily an executable intent inference problem. By employing multi-agent workflows for reverse engineering and validation of BDD specifications, it delivers both dramatic repair performance and qualitative gains in patch minimality and semantic alignment. The central implication is direct: future research should prioritize architectures and models that learn to extract and validate developer intent, as opposed to brute-force code generation.

Prometheus substantiates that strong AI-driven repair is not a function of model scale, but of alignment with precise, executable intent. This shifts the research focus in APR and agentic systems at large from "how to synthesize code" to "how to specify and verify intent."

Markdown Report Issue