FineEdit: Fine-Grained Image Edit with Bounding Box Guidance

Published 13 Apr 2026 in cs.CV | (2604.10954v1)

Abstract: Diffusion-based image editing models have achieved significant progress in real world applications. However, conventional models typically rely on natural language prompts, which often lack the precision required to localize target objects. Consequently, these models struggle to maintain background consistency due to their global image regeneration paradigm. Recognizing that visual cues provide an intuitive means for users to highlight specific areas of interest, we utilize bounding boxes as guidance to explicitly define the editing target. This approach ensures that the diffusion model can accurately localize the target while preserving background consistency. To achieve this, we propose FineEdit, a multi-level bounding box injection method that enables the model to utilize spatial conditions more effectively. To support this high precision guidance, we present FineEdit-1.2M, a large scale, fine-grained dataset comprising 1.2 million image editing pairs with precise bounding box annotations. Furthermore, we construct a comprehensive benchmark, termed FineEdit-Bench, which includes 1,000 images across 10 subjects to effectively evaluate region based editing capabilities. Evaluations on FineEdit-Bench demonstrate that our model significantly outperforms state-of-the-art open-source models (e.g., Qwen-Image-Edit and LongCat-Image-Edit) in instruction compliance and background preservation. Further assessments on open benchmarks (GEdit and ImgEdit Bench) confirm its superior generalization and robustness.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces a novel diffusion-based image editing framework that integrates bounding box guidance via a two-stage multi-level architecture.
The method combines early and deep fusion techniques to inject clear spatial priors, enhancing both edit localization and background integrity.
Reinforcement learning with decoupled rewards for ROI and background enables high-fidelity, robust edits validated by superior quantitative and qualitative benchmarks.

FineEdit: A Multi-Level Architecture for Fine-Grained Diffusion-Based Image Editing with Bounding Box Guidance

Introduction and Motivation

FineEdit (2604.10954) addresses fundamental limitations of current diffusion-based image editing models, particularly the inability of text-only guidance to achieve precise spatial localization and background consistency in complex multi-object scenes. Existing paradigms, including mask-based and global structural priors, either lack fine control over edit regions or suffer from boundary leakage and background artifacts. FineEdit integrates explicit bounding box guidance via a two-stage multi-level architecture, unifying token-level fusion in both shallow and deep transformer layers, and augments learning with a decoupled region/background reward during post-training. The resulting system enables precise, high-fidelity, and robust controllable image editing.

Figure 1: Overview of FineEdit framework, which includes two training stages: (a) Pre-training stage establishes multi-level spatial priors using early and deep fusion. (b) Post-training stage applies reinforcement learning with a novel decoupled reward function.

Architecture and Training

Multi-Level Spatial Condition Injection

FineEdit adapts the MM-DiT backbone with two orthogonal fusion mechanisms:

Early Fusion: Encodes bounding box masks at the input space and concatenates masked source latents with noisy target latents along the channel dimension. This provides strong spatial priors explicitly localizing the edit area.
Deep Fusion: Employs a side-adapter transformer that injects features extracted from the masked source image into intermediate transformer layers, enhancing alignment and accelerating convergence while maintaining the integrity of both ROI and the background manifold.

Ablation demonstrates that exclusive reliance on either strategy is suboptimal; only the joint application realizes both convergence and optimal quantitative/qualitative results.

Reinforcement Learning with Decoupled Rewards

FineEdit introduces a post-training RL stage with a decoupled reward:

Foreground (ROI) Reward: Assesses semantic compliance within the bounding box using Qwen3-VL to estimate instruction adherence.
Background Reward: Utilizes a fusion of model-based (e.g., semantic similarity) and pixel-level metrics (e.g., PSNR), measuring fidelity outside the ROI.
This design overcomes the reward signal entanglement inherent in global VLM evaluation—enabling a controlled tradeoff between localized edit fidelity and global context preservation. The ablation validates superior background consistency and increased ROI accuracy compared to global reward RL.

Unified Visual Instruction Paradigm

FineEdit utilizes a unified interface for visual instructions, meaning the same underlying architecture handles:

Local Edits: Arbitrary object addition, replacement, or removal within specified boxes.
Global Edits: Applying edits over the entire image when the bounding box encompasses the whole spatial domain.
Hybrid Edits: Simultaneous multi-object edits using multiple bounding box cues.
Figure 2: Unified visual instructions for diverse editing settings: localized style transfer, single-bbox composite editing, and multi-bbox composite editing.

A key capability—demonstrated in extensive qualitative examples—is bounded style transfer, where stylistic operations are effectively restricted to an object or region without undesired global spillover.

Dataset and Benchmark Construction

FineEdit-1.2M Dataset

FineEdit-1.2M is a 1.2M pair dataset with strict filtering, bounding box annotation, and human-in-the-loop/automated multi-stage validation:

Data Sources: Combinations of LAION-Aesthetic-6M, Megalith-10M, diverse perception datasets, and internal images.
U-GAF Pipeline: Integrates automated object localization (Co-DETR), bounding box filtering, prompt curation (InternVL3), rigorous edit synthesis (Flux1-kontext-dev, Qwen-Image-Edit), and a multi-stage filtering regime using IoU, RGB entropy, VLM scoring, and EditScore validation.
Figure 3: The U-GAF Pipeline integrates data curation, annotation, edit synthesis, and refinement for dataset construction.
Evaluation Dataset: FineEdit-1k—1,000 images across broad categories, manually validated with pixel-differencing and realignment between pre- and post-edit bounding boxes.
Figure 4: Dataset analysis showing distributions and human evaluation accuracy—demonstrating precise annotation and broad diversity.

Experimental Results

Quantitative Benchmarks

FineEdit establishes SOTA across broad image editing benchmarks, outperforming Qwen-Image-Edit, LongCat-Image-Edit, Step1X, BAGEL, OmniGen2, and FireRed-Image-Edit. Evaluations show:

FineEdit-1k Bench: PSNR 34.4, SSIM 0.91, LPIPS 0.08 (background); CLIP 0.27, Prompt Compliance (PC) 4.67, Visual Naturalness (VN) 4.71, Physical Detail Integrity (PDI) 4.69 (region).
ImgEdit-Bench, GEdit-Bench: Consistently highest or near-highest human-rated compliance, naturalness, detail, and out-of-box retention across all edit types.
Human Evaluation: Blind preference trials decisively select FineEdit over baseline SOTA architectures.
Figure 5: Comparison on FineEdit-1k Evaluation Bench—FineEdit achieves superior scores on all metrics.

Figure 6: Comparison of human evaluation win rates between FineEdit and competing methods.

Qualitative and Robustness Analyses

FineEdit delivers artifact-free, semantically accurate, and visually natural edits under all settings, including complex multi-object, style transfer, and text edits. Robustness analyses validate insensitivity to bounding box scale variation and tolerate moderate spatial perturbation in box definition.

Figure 7: Qualitative Results Across Diverse Editing Tasks, illustrating the flexibility and high fidelity of FineEdit.

Figure 8: Robustness to Various Bounding Box Scales and Ratios, maintaining edit quality from small to large regions.

Figure 9: Robustness to Bounding Box Perturbations, highlighting resilience to imprecise user input.

Ablation Study

Ablation experiments systematically validate:

Multi-Level Fusion Superiority: Early or deep fusion mechanisms alone yield suboptimal localization or convergence; only their combination achieves top performance across both background and ROI metrics.
Decoupled RL Reward: Isolating ROI and background rewards directly improves fine-grained edit accuracy and prevents background collapse observed under global reward RL.

Theoretical and Practical Implications

FineEdit substantiates a paradigm shift in controllable diffusion image editing. Its architecture demonstrates that explicit spatial priors, injected at both low and deep representational levels and paired with decoupled localized reward optimization, are the keys to high-precision and robust controllable region editing. This finding generalizes to any editing system where fine synchronization between instruction, spatial prior, and latent generation is required.

Practically, FineEdit enables real-world deployment in scenarios that demand reliable object manipulation without background drift—relevant in professional graphics, AR/VR content authoring, commercial image editing, and interactive human-in-the-loop creative systems.

Limitations and Future Work

While FineEdit advances spatial control, it relies on rectangular bounding box cues; it cannot handle free-form, irregular, or fine-shape specified edits. Extending the interface to include point, scribble, and polygonal inputs, while maintaining training efficiency and edit fidelity, is a key next step. Future architectures may generalize the multi-level spatial condition framework to arbitrary visual prompts.

Conclusion

FineEdit introduces a cohesive architectural and dataset innovation, solving key challenges in fine-grained, background-preserving, and controllable diffusion-based image editing. Its technical contributions—including multi-level spatial instruction injection, a decoupled RL reward strategy, and the accompanying high-precision dataset and benchmarks—yield performance improvements substantiated by both quantitative and qualitative evaluation. The methodology and analysis inform continued progress toward reliable and flexible visual instruction editing systems in large-scale vision models.

Markdown Report Issue