SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?

Published 28 Apr 2026 in cs.SE and cs.AI | (2604.25737v1)

Abstract: Instructed code editing is a significant challenge for LLMs. On the EditBench benchmark, 39 of 40 evaluated models obtain a task success rate (TSR) below 60 percent, highlighting a gap between general code generation and the ability to perform instruction-driven editing under executable test constraints. To address this, we propose SAFEdit, a multi-agent framework for instructed code editing that decomposes the editing process into specialized roles to improve reliability and reduce unintended code changes. A Planner Agent produces an explicit, visibility-aware edit plan, an Editor Agent applies minimal, literal code modifications, and a Verifier Agent executes real test runs. When tests fail, SAFEdit uses a Failure Abstraction Layer (FAL) to transform raw test logs into structured diagnostic feedback, which is fed back to the Editor to support iterative refinement. We compare SAFEdit against both prior single-model results reported for EditBench and an implemented ReAct single-agent baseline under the same evaluation conditions. We used EditBench to evaluate SAFEdit on 445 code editing instances in five languages (English, Polish, Spanish, Chinese, and Russian) under varying spatial context variants. SAFEdit achieved 68.6 percent TSR, outperforming the single-model baseline by 3.8 percentage points and the ReAct single-agent baseline by 8.6 percentage points. The iterative refinement loop was found to contribute 17.4 percentage points to SAFEdit's overall success rate. SAFEdit's automated error analysis further indicates a reduction in instruction-level hallucinations compared to single-agent approaches, providing an additional framework component for interpreting failures beyond pass or fail outcomes.

Abstract PDF Upgrade to Chat

Authors (7)

Summary

The paper presents a multi-agent system, SAFEdit, that decomposes code editing into planning, editing, and verification to enhance reliability.
SAFEdit achieves a task success rate boost by up to 17.4 percentage points and eliminates regression errors through iterative correction cycles.
Experimental results demonstrate better multilingual performance and robustness compared to single-agent approaches like ReAct and conventional LLM pipelines.

Structured Multi-Agent Decomposition for Reliable Instructed Code Editing: The SAFEdit Framework

Introduction

The paper "SAFEdit: Does Multi-Agent Decomposition Resolve the Reliability Challenges of Instructed Code Editing?" (2604.25737) provides a rigorous exploration of reliability in LLM-based instructed code editing, with a focus on architectural decomposition. Prior studies demonstrated persistent limitations in LLMs on the EditBench benchmark, with most models achieving task success rates (TSR) below 60% under executable test constraints. SAFEdit is introduced as a multi-agent system for code editing, instantiated via CrewAI, which splits the task into planning, editing, and verification roles. The central hypothesis is that explicit decomposition, together with iterative, execution-grounded refinement, addresses inherent reliability limitations of single-model and single-agent pipelines.

Figure 1: Overview of the SAFEdit framework, detailing Planner, Editor, and Verifier agents operating in an iterative refinement loop with structured feedback and error taxonomy.

Methodology: Agent Design and Iterative Pipeline

Task and Input Representation

Instructed code editing requires transforming existing code according to natural-language instructions and passing associated unit test suites. The paper operationalizes EditBench’s spatial context variants—CODE ONLY, HIGHLIGHT, HIGHLIGHT + CURSOR POSITION—to assess the influence of localization cues on editing performance. The dataset comprises 445 filtered, validated editing tasks spanning English, Polish, Spanish, Chinese, and Russian.

SAFEdit Agent Roles

Planner Agent: Receives instructions and visible code, producing structured edit plans with semantic intent, target location, explicit modification actions, and constraints, while adhering strictly to the information relevant for the selected visibility variant.
Editor Agent: Given the edit plan, applies minimal, literal changes to the original code, preserving invariants and formatting. Edits are restricted to instructed fragments, minimizing collateral modifications.
Verifier Agent: Executes the proposed code using real unit tests in a sandbox, generating pass/fail signals and granular logs.

When tests fail, a deterministic Failure Abstraction Layer (FAL) produces structured diagnostic feedback—mapping exceptions, expected/actual values, and repair suggestions, enabling targeted iterative refinement. SAFEdit permits up to three correction cycles, balancing improvement against computational cost.

Experimental Setup and Baselines

Baseline Comparison

SAFEdit is benchmarked against:

EditBench Single-Model Baselines: Zero-shot LLMs interact via prompt-response editing.
ReAct Baseline: Single-agent iterative code editing via the reasoning–acting paradigm, instantiated with the same GPT-4.1 backbone for parity. Both operate under identical iteration budgets and test infrastructure.

Evaluation Metrics

Task Success Rate (TSR): Fraction of tasks where the edited code passes all unit tests.
Iteration Efficiency: Average refinement steps per task.
First-Try Success Rate: Proportion resolved in the initial edit.
Failure Taxonomy Distribution: Classification of errors—Instruction Hallucination (IH), Implementation Gap (IG), Regression Error (RE), Context Misalignment (CM)—using LLM-based analysis for interpretability.

Results: Quantitative and Qualitative Performance

Aggregate Performance

SAFEdit achieved 68.6% TSR under the HIGHLIGHT context, outperforming ReAct by +8.6pp and the strongest EditBench leaderboard model (claude-sonnet-4) by +3.8pp, all using identical backbone models.

Figure 2: Comparative TSR under HIGHLIGHT visibility for SAFEdit, ReAct, and EditBench leaderboard baselines.

Multilingual Robustness and Variant Sensitivity

Across languages, SAFEdit consistently outperformed ReAct (improvements from +5.7pp to +12.4pp), suggesting the benefits are not language-specific. Performance was stable across context variants in English, with minor regressions in some non-English settings when introducing fine-grained spatial cues. Notably, SAFEdit’s planner abstraction mitigated context-sensitivity, promoting robust localization and intent extraction.

Refinement contributed an average of +17.4pp to final TSR. About half of tasks were solved on the first pass, with most remaining failures resolved within one or two additional correction cycles. The structured feedback loop via the FAL prevented over-editing and guided targeted repairs.

Failure Taxonomy Insights

Failure modes for SAFEdit and ReAct diverged. ReAct exhibited dominant IG errors (failure to implement the correct change), with non-trivial regression errors in several languages. SAFEdit distributed failures across IH and IG, entirely eliminating regression errors (0.0%) in all languages and minimizing context misalignment. This demonstrates qualitative improvements beyond aggregate TSR.

Figure 3: Stacked error distributions for failed editing tasks by root-cause taxonomy categories across evaluated languages (SAFEdit and ReAct).

Discussion

Structured decomposition in SAFEdit yielded measurable improvements over single-agent approaches, with robustness to varying context granularity and language. Architectural separation enabled each agent to specialize—reducing instruction hallucination, over-editing, and regression errors. Iterative correction was computationally efficient, leveraging deterministic structured feedback versus raw log parsing, with minimal LLM overhead.

The empirical results establish that reliable code editing is not a mere function of LLM scale, but critically depends on explicit process structure, role separation, and execution-grounded verification. The findings have implications for agentic system design and benchmarking: multi-agent orchestration offers interpretable error signals, reduces collateral damage, and promotes reproducible editing outcomes.

The taxonomy-based analysis revealed that structural improvements reshape—not just reduce—failure profiles, highlighting the necessity of qualitative evaluation metrics alongside aggregate pass/fail rates.

Implications and Future Directions

SAFEdit demonstrates the efficacy of multi-agent decomposition for instructed editing under real-world constraints, establishing a new reliability bar for EditBench-style tasks. Practical adoption of such frameworks may require consideration of cost and latency, though average iteration counts were low.

Future research avenues include:

Extending agentic editing to broader, repository-level scenarios, integrating more complex propagation and dependency management.
Evaluating architectural decomposition across diverse backbone models and longer correction horizons.
Enhancing taxonomy with human-in-the-loop evaluations for richer, more granular failure interpretability.

Conclusion

The SAFEdit framework illustrates that multi-agent decomposition, iterative structured refinement, and explicit feedback mechanisms substantially enhance reliability in instructed code editing. Achieving consistent gains in both aggregate and qualitative metrics, SAFEdit sets a precedent for building trustworthy LLM-powered developer tools where task-localization, execution validation, and regression avoidance are critical. Architectural design, not just backbone capability, is shown to be a key determinant of real-world editing effectiveness.

Markdown Report Issue