Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions

Published 17 Apr 2026 in cs.CV | (2604.15917v1)

Abstract: Instruction guided image editing has advanced substantially with recent generative models, yet it still fails to produce reliable results across many seemingly simple cases. We observe that a large portion of these failures stem not from insufficient model capacity, but from poorly formulated editing tasks, such as those involving small targets, implicit spatial relations, or under-specified instructions. In this work, we frame image editing failures as a task formulation problem and propose an adaptive task reformulation framework that improves editing performance without modifying the underlying model. Our key idea is to transform the original image-instruction pair into a sequence of operations that are dynamically determined and executed by a MLLM agent through analysis, routing, reformulation, and feedback-driven refinement. Experiments on multiple benchmarks, including ImgEdit, PICA, and RePlan, across diverse editing backbones such as Qwen Image Edit and Nano Banana, show consistent improvements, with especially large gains on challenging cases. These results suggest that task reformulation is a critical but underexplored factor, and that substantial gains can be achieved by better matching editing tasks to the effective operating regime of existing models.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper shows that adaptive task reformulation reduces editing failures by aligning task specifications with the model’s strengths.
It details a modular framework that integrates query profiling, routing, and agentic multi-step planning for robust image editing.
Experimental results demonstrate significant improvements in target isolation and spatial consistency over traditional direct editing methods.

Adaptive Task Reformulation for Robust Instruction-Guided Image Editing

Introduction and Motivation

Instruction-guided image editing remains a critical capability in multimodal generative models, enabling direct manipulation of images conditioned on natural language requests. Despite rapid progress, state-of-the-art image editing models frequently fail in cases involving small targets, implicit spatial relations, or under-specified instructions. The fundamental observation presented in "Making Image Editing Easier via Adaptive Task Reformulation with Agentic Executions" (2604.15917) is that a significant fraction of these failures originate not from insufficient model expressivity, but rather from a misalignment between the task specification and the model’s effective operating regime.

The central claim is that performance deficits in existing editing pipelines can often be attributed to suboptimal input task formulations. By reformulating the image-instruction pair into structured subproblems aligned with model strengths, editing reliability improves substantially, even when the generative backbone remains unmodified. This represents a paradigm shift from scaling model capacity toward intelligent, context-aware task presentation, posing broad implications for practical multimodal system deployment.

Framework Overview

The Adaptive Task Reformulation (ATR) framework centers around agentic execution, combining a multimodal LLM (MLLM)-based agent for dynamic, multi-step planning with a routing mechanism informed by structured query profiling.

Given an input image and instruction, the process decomposes as follows:

Query Profiling: Extraction of semantic targets, spatial dependencies, and instruction ambiguity via an edit profiler.
Routing: Classification of each task into one of three reformulation strategies: direct/rewrite execution, spatial decoupling, or localized editing.
Agentic Planning: Multi-step rollout using custom toolchains (e.g., instruction rewriting, target segmentation, localized cropping, smart pasting), with continuous feedback and fallback safety for robustness.

This modular architecture allows ATR to dynamically adapt execution strategy to the latent structure of each editing request, optimizing instruction formulation, region focus, and composition mechanics according to perceived failure risk.

Figure 1: Overview of the ATR framework, illustrating the pipeline from query profiling through reformulation routing to agentic multi-stage execution.

Pilot Study and Failure Typology

Through a comprehensive analysis of common editing failures on real-world benchmarks (e.g., PICA, ImgEdit), the paper establishes a clear empirical taxonomy. Categories such as target ambiguity, local entanglement, structural dependency, and scene-wide consistency mismatch were systematically shown to benefit from distinct reformulation interventions. Notably, for each identified failure type, at least one reformulation strategy (instruction rewriting, cropping, or spatial disentanglement) reliably outperforms direct mapping.

This motivates the ATR design where task-aware routing—rather than a single fixed editing pipeline—dramatically expands the reliability envelope of existing editing models.

Route-Conditioned Agentic Execution

Reformulation Strategies

Direct/Rewrite Execution: Applied to well-grounded or only semantically ambiguous queries. The agent decides between direct editing and instruction rewriting for clarity before model invocation.
Spatial Decoupling: Decomposes complex transformations involving strong physical or structural coupling (e.g., relocation, object removal with context preservation) into sequential sub-tasks: segmentation, region erasure, content-aware completion, and spatial recomposition.
Localized Editing: Addresses manipulation of small or weakly grounded targets by dynamically cropping image regions to maximize signal-to-noise—conducting edits at high effective resolution and subsequently re-integrating the results with context-aware blending.

Each route employs route-specific toolchains (SAM-based segmentation, offset calculators, instruction rewriting modules) and bounded fallback mechanisms, forming a closed-loop system where intermediate outputs are continually evaluated for correctness prior to termination.

Figure 3: Detailed execution flow for instruction rewriting (Route A2), highlighting sequential clarification and action steps.

Figure 5: Execution flow for spatial decoupling (Route B), illustrating region segmentation, target relocation, and region completion.

Experimental Results

Comprehensive evaluation on ImgEdit, PICA, and RePlan benchmarks demonstrates that ATR delivers systematic improvements across model architectures and task categories. The gains are especially pronounced in challenging scenarios identified by the pilot study.

Quantitative: On ImgEdit-Hard, Qwen-Edit-ATR achieves 4.13 versus a direct-edit baseline of 3.57. ATR-enhanced light-weight models (e.g., Nano Banana-ATR) approach or even marginally outperform their much larger Pro variants.
Qualitative: ATR consistently prevents common failure modes such as target disappearance, localization error, and context hallucination. It excels at fine-grained target isolation, preservation of global structure, and precise compositional edits.
Figure 6: Qualitative results on ImgEdit, with ATR exhibiting superior target isolation and attribute editing compared to direct editing baselines.

Figure 7: Results on PICA; ATR circumvents failures such as spatial translation error and object erasure, achieving pixel-level structural decoupling.

Additional supplementary figures further support generalization across diverse edit types, scene complexities, and model sizes.

Figure 8: Additional qualitative results on PICA demonstrating accurate region isolation and integration.

Figure 2: Additional qualitative results on RePlan showing refined instruction grounding and spatial reference resolution.

Analysis and Ablations

Ablation studies confirm the critical contributions of each ATR module:

Routing and Context Awareness: Incremental introduction of semantic rewriting, spatial logic, and local cropping each produces tangible performance gains; context-conditioned routing achieves the highest composite scores.
Agentic Pipeline Robustness: Closed-loop execution with terminal verification and fallback mechanisms is essential—minimizing the performance gap to oracle-controlled pipelines.

The detailed toolchain definitions reveal how explicit rel1000 coordinate systems, deterministic instruction rewriting, and compositional blending mechanics stabilize execution and limit degenerate transformations.

Limitations and Future Work

Despite substantial progress, the ATR paradigm exposes fundamental challenges in balancing local precision and global attribute consistency, particularly for tasks demanding simultaneous fine-grained manipulation and holistic style preservation. Cases in which perfect local edits introduce global visual mismatches (e.g., color space violations, style inconsistencies) remain unsolved by current agentic routing. Addressing these limitations will require future advances in joint local-global reasoning, possibly via cross-modal context propagation or dual-level agent hierarchies.

Figure 4: Illustration of framework limitations, such as local-global style conflict during editing in highly structured scenes.

Theoretical and Practical Implications

The introduction of ATR reframes image editing reliability as an inference-time adaptation problem, rather than a parameter scaling challenge. By demonstrating that strategic task reformulation can enable smaller models to challenge large-scale strong baselines, the work suggests a practical path toward efficient, robust editing pipelines in real-world deployments—especially pertinent for cloud-edge and resource-constrained settings. Furthermore, this underscores the value of MLLM-based agents as inference-time planners, seamlessly integrating multimodal toolchains under dynamic, context-sensitive policies.

Conclusion

The ATR framework provides strong evidence that adaptive task reformulation is a key—yet previously underexploited—lever for improving editing performance across challenging scenarios. By decoupling instruction and execution logic and leveraging agentic rollouts informed by structured profiling, ATR significantly narrows the remaining gap between generative model capacity and real-world editing reliability. Future AI systems for multimodal content manipulation will likely integrate similar adaptive, agent-driven design principles to further advance the robustness and fidelity of user-guided generation.

Markdown Report Issue