DCEdit: Dual-Level Controlled Image Editing via Precisely Localized Semantics (2503.16795v1)

Published 21 Mar 2025 in cs.CV

Abstract: This paper presents a novel approach to improving text-guided image editing using diffusion-based models. Text-guided image editing task poses key challenge of precisly locate and edit the target semantic, and previous methods fall shorts in this aspect. Our method introduces a Precise Semantic Localization strategy that leverages visual and textual self-attention to enhance the cross-attention map, which can serve as a regional cues to improve editing performance. Then we propose a Dual-Level Control mechanism for incorporating regional cues at both feature and latent levels, offering fine-grained control for more precise edits. To fully compare our methods with other DiT-based approaches, we construct the RW-800 benchmark, featuring high resolution images, long descriptive texts, real-world images, and a new text editing task. Experimental results on the popular PIE-Bench and RW-800 benchmarks demonstrate the superior performance of our approach in preserving background and providing accurate edits.

Authors (8)

Yihan Hu (18 papers)
Jianing Peng (1 paper)
Yiheng Lin (50 papers)
Ting Liu (329 papers)
Xiaochao Qu (13 papers)
Luoqi Liu (28 papers)
Yao Zhao (272 papers)
Yunchao Wei (151 papers)

Summary

This paper introduces DCEdit, a novel approach for text-guided image editing using Diffusion Transformer (DiT) models like FLUX [10, 22]. The primary challenge addressed is achieving precise semantic localization and editing – modifying specific parts of an image according to text prompts while preserving the background and unchanged elements. Existing methods often struggle with accurately identifying the target semantic regions, especially with complex prompts and high-resolution images. DiT models show better text-image alignment than previous UNet models, but their raw cross-attention maps still suffer from incomplete activations and semantic entanglement (incorrect regions activated due to related concepts in the prompt).

To tackle these issues, DCEdit proposes two main components:

Precise Semantic Localization (PSL): This strategy refines the cross-attention maps ( $A_{V \to T}$ $A_{V \to T}$ ) extracted from the Multi-Modal DiT (MM-DiT) layers of a DiT model.
- It leverages the visual self-attention map ( $A_{V \to V}$ ), which captures affinities between image tokens, to complete segmented regions in the cross-attention map, addressing internal holes and unclear boundaries.
- It uses the inverse of the textual self-attention map ( $M_T^{-1}$ ) to disentangle semantics. The textual self-attention map ( $A_{T \to T}$ ) reflects how different text tokens (concepts) are coupled; its inverse helps mitigate the activation of irrelevant image regions caused by this coupling.
- The final refined attention map $M$ for a specific semantic token is calculated as:
  
  $M = norm(M_V \cdot Select[(A_{V \to T}) \cdot M_T^{-1}])$
  
  where $M_V$ and $M_T$ are fused visual and textual self-attention maps, $Select[\cdot]$ extracts the map for the target text token, and $norm(\cdot)$ normalizes the result. This refined map $M$ serves as a precise regional cue.
Dual-Level Control (DLC): This mechanism uses the refined map $M$ $M$ from PSL to guide the image generation process during editing, applying control at two levels in a plug-and-play manner without retraining or tuning the base DiT model.
- Feature-Level Control: Instead of directly swapping features from the source image inversion process into the editing (sampling) process (which can suppress edits), DLC uses a soft fusion approach guided by the continuous score map $M$ . For the final r layers of the DiT during the sampling/editing process:
  
  $\hat{V}_l = M \odot V_l^t + (1-M) \odot V_l^s$
  
  Here, $V_l^t$ are the projected value features from the target editing step, $V_l^s$ are the corresponding features stored during the source image inversion, and $\odot$ is element-wise multiplication. This selectively integrates source features in non-edited regions while preserving target features in edited regions.
- Latent-Level Control: To further enhance background preservation, especially given imperfections in inversion methods like rectified flow, DLC applies latent blending. It uses a binarized version of the refined map, $M_\lambda$ (thresholded at the $\lambda$ -th percentile), to combine latents from the inversion ( $Z_{t_{i-1}}^{inv}$ ) and sampling ( $Z_{t_{i-1}}^s$ ) processes during the early diffusion steps:
  
  $\hat{Z}_{t_{i-1}} = M_\lambda \odot Z_{t_{i-1}}^s + (1 - M_\lambda) \odot Z_{t_{i-1}}^{inv}$
  
  This enforces consistency with the source image in the background regions identified by $(1 - M_\lambda)$ .

The overall editing pipeline involves:

Inverting the source image $I_s$ using its prompt $P_s$ via a diffusion inversion method (like rectified flow) to get the initial noise $Z_{t_K}$ . Intermediate features $V_l^s$ and latents $Z_{t_i}^{inv}$ are stored.
Applying PSL during the first inversion step to compute the refined map $M$ for the semantic difference between $P_s$ and the target prompt $P_t$ .
Generating the edited image by sampling from $Z_{t_K}$ using the target prompt $P_t$ , guided by the DLC mechanism (feature-level soft fusion and latent-level blending) using $M$ and $M_\lambda$ .

To evaluate DiT-based editing methods effectively, the authors introduce the RW-800 benchmark. Compared to previous benchmarks like PIE-Bench [18], RW-800 features:

Higher resolution images (1K+ vs. 512x512).
Longer, more descriptive text prompts (avg. 23 words vs. <12).
Exclusively real-world images with complex content.
A new "text editing" task category alongside 9 other common editing types (object change, color change, etc.).
Manually refined masks for quantitative evaluation.

Experiments on PIE-Bench and RW-800 demonstrate that DCEdit, when applied to DiT-based methods like RF-Edit [51] and FireFlow [9], significantly improves performance. It enhances background preservation (lower MSE, higher PSNR/SSIM) and structural consistency while simultaneously improving or maintaining editing quality (higher CLIP similarity), outperforming both UNet-based and baseline DiT-based editing methods. Ablation studies confirm the effectiveness of both PSL components (VSA and TSA refinement) and both DLC levels (feature and latent control). The PSL strategy is shown to produce more accurate semantic localization maps compared to baseline cross-attention from FLUX, SD-1.5, and SD-XL.

Implementation Considerations:

DCEdit is designed as a training-free, plug-and-play module for existing DiT-based editing methods, particularly those using rectified flow like RF-Edit and FireFlow, built upon models like FLUX.
It relies on extracting attention maps ( $A_{V \to T}, A_{V \to V}, A_{T \to T}$ ) from the MM-DiT layers during the inversion process (specifically the first step for PSL).
The Dual-Level Control operates during the sampling (editing) phase. Feature control requires storing value embeddings ( $V_l^s$ ) from the last r layers during inversion. Latent control requires storing intermediate latents ( $Z_{t_i}^{inv}$ ) from inversion.
Key hyperparameters include the number of layers r for feature control and the percentile threshold $\lambda$ for binarizing the mask in latent control. The paper uses r=1 or r=3 and controls latent blending for the first 3 or 5 steps.
The computational overhead is minimal as it primarily involves attention map manipulation and blending operations, adding little cost compared to the diffusion model inference itself.
It requires identifying the differential words between source and target prompts (diff(Ps, Pt)) to select the correct semantic map from PSL.

PDF Markdown

Related Papers

Find Related Papers