This paper introduces DCEdit, a novel approach for text-guided image editing using Diffusion Transformer (DiT) models like FLUX [10, 22]. The primary challenge addressed is achieving precise semantic localization and editing – modifying specific parts of an image according to text prompts while preserving the background and unchanged elements. Existing methods often struggle with accurately identifying the target semantic regions, especially with complex prompts and high-resolution images. DiT models show better text-image alignment than previous UNet models, but their raw cross-attention maps still suffer from incomplete activations and semantic entanglement (incorrect regions activated due to related concepts in the prompt).
To tackle these issues, DCEdit proposes two main components:
- Precise Semantic Localization (PSL): This strategy refines the cross-attention maps (AV→T) extracted from the Multi-Modal DiT (MM-DiT) layers of a DiT model.
- It leverages the visual self-attention map (AV→V), which captures affinities between image tokens, to complete segmented regions in the cross-attention map, addressing internal holes and unclear boundaries.
- It uses the inverse of the textual self-attention map (MT−1) to disentangle semantics. The textual self-attention map (AT→T) reflects how different text tokens (concepts) are coupled; its inverse helps mitigate the activation of irrelevant image regions caused by this coupling.
The final refined attention map M for a specific semantic token is calculated as:
M=norm(MV⋅Select[(AV→T)⋅MT−1])
where MV and MT are fused visual and textual self-attention maps, Select[⋅] extracts the map for the target text token, and norm(⋅) normalizes the result. This refined map M serves as a precise regional cue.
- Dual-Level Control (DLC): This mechanism uses the refined map M from PSL to guide the image generation process during editing, applying control at two levels in a plug-and-play manner without retraining or tuning the base DiT model.
Feature-Level Control: Instead of directly swapping features from the source image inversion process into the editing (sampling) process (which can suppress edits), DLC uses a soft fusion approach guided by the continuous score map M. For the final r
layers of the DiT during the sampling/editing process:
V^l=M⊙Vlt+(1−M)⊙Vls
Here, Vlt are the projected value features from the target editing step, Vls are the corresponding features stored during the source image inversion, and ⊙ is element-wise multiplication. This selectively integrates source features in non-edited regions while preserving target features in edited regions.
Latent-Level Control: To further enhance background preservation, especially given imperfections in inversion methods like rectified flow, DLC applies latent blending. It uses a binarized version of the refined map, Mλ (thresholded at the λ-th percentile), to combine latents from the inversion (Zti−1inv) and sampling (Zti−1s) processes during the early diffusion steps:
Z^ti−1=Mλ⊙Zti−1s+(1−Mλ)⊙Zti−1inv
This enforces consistency with the source image in the background regions identified by (1−Mλ).
The overall editing pipeline involves:
- Inverting the source image Is using its prompt Ps via a diffusion inversion method (like rectified flow) to get the initial noise ZtK. Intermediate features Vls and latents Ztiinv are stored.
- Applying PSL during the first inversion step to compute the refined map M for the semantic difference between Ps and the target prompt Pt.
- Generating the edited image by sampling from ZtK using the target prompt Pt, guided by the DLC mechanism (feature-level soft fusion and latent-level blending) using M and Mλ.
To evaluate DiT-based editing methods effectively, the authors introduce the RW-800 benchmark. Compared to previous benchmarks like PIE-Bench [18], RW-800 features:
- Higher resolution images (1K+ vs. 512x512).
- Longer, more descriptive text prompts (avg. 23 words vs. <12).
- Exclusively real-world images with complex content.
- A new "text editing" task category alongside 9 other common editing types (object change, color change, etc.).
- Manually refined masks for quantitative evaluation.
Experiments on PIE-Bench and RW-800 demonstrate that DCEdit, when applied to DiT-based methods like RF-Edit [51] and FireFlow [9], significantly improves performance. It enhances background preservation (lower MSE, higher PSNR/SSIM) and structural consistency while simultaneously improving or maintaining editing quality (higher CLIP similarity), outperforming both UNet-based and baseline DiT-based editing methods. Ablation studies confirm the effectiveness of both PSL components (VSA and TSA refinement) and both DLC levels (feature and latent control). The PSL strategy is shown to produce more accurate semantic localization maps compared to baseline cross-attention from FLUX, SD-1.5, and SD-XL.
Implementation Considerations:
- DCEdit is designed as a training-free, plug-and-play module for existing DiT-based editing methods, particularly those using rectified flow like RF-Edit and FireFlow, built upon models like FLUX.
- It relies on extracting attention maps (AV→T,AV→V,AT→T) from the MM-DiT layers during the inversion process (specifically the first step for PSL).
- The Dual-Level Control operates during the sampling (editing) phase. Feature control requires storing value embeddings (Vls) from the last
r
layers during inversion. Latent control requires storing intermediate latents (Ztiinv) from inversion.
- Key hyperparameters include the number of layers
r
for feature control and the percentile threshold λ for binarizing the mask in latent control. The paper uses r=1
or r=3
and controls latent blending for the first 3 or 5 steps.
- The computational overhead is minimal as it primarily involves attention map manipulation and blending operations, adding little cost compared to the diffusion model inference itself.
- It requires identifying the differential words between source and target prompts (
diff(Ps, Pt)
) to select the correct semantic map from PSL.