LocInv: Localization-aware Inversion for Text-Guided Image Editing (2405.01496v1)

Published 2 May 2024 in cs.CV

Abstract: Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL

References (65)

Authors (4)

Chuanming Tang (9 papers)
Kai Wang (624 papers)
Fei Yang (110 papers)
Joost van de Weijer (133 papers)

Citations (2)

View on Semantic Scholar

Summary

This paper introduces LocInv (Localization-aware Inversion), a method designed to improve text-guided image editing using diffusion models like Stable Diffusion. The core problem addressed is cross-attention leakage, where existing editing techniques inadvertently modify regions outside the intended target area because the model's attention mechanism doesn't perfectly align object concepts in the text prompt with the correct spatial regions in the image. This is particularly challenging in images with multiple objects.

To combat this, LocInv incorporates localization priors – specifically, segmentation masks or bounding boxes – during the DDIM inversion process. These priors, which can be obtained from foundation models like SAM or Grounding DINO, guide the refinement of cross-attention maps.

The key mechanism involves dynamic prompt learning, where the token embeddings corresponding to noun words in the text prompt are updated at each timestep of the denoising process. This update is guided by optimization losses designed to align the cross-attention maps ( $A_t$ ) with the provided localization priors ( $S_t$ ):

Similarity Loss ( $\mathcal{L}_{sim}$ ): Encourages high cosine similarity between the attention map of a noun token and its corresponding localization prior.

$\mathcal{L}_{sim} = \sum_{i=1}^{K} \big[1-\mathrm{cos} (A_t^{v_t^i},S_t^{v_t^i}) \big ]$
Overlapping Loss ( $\mathcal{L}_{ovl}$ ): Maximizes the portion of the attention map that falls within the localization prior region.

$\mathcal{L}_{ovl} = 1- \frac{\sum_{i=1}^{K} A_t^{v_t^i} \cdot S_t^{v_t^i}}{\sum_{i=1}^{K} A_t^{v_t^i}}$

These losses are combined ( $\mathcal{L} = \lambda_{sim} \mathcal{L}_{sim} + \lambda_{ovl} \mathcal{L}_{ovl}$ ) and optimized iteratively for the noun tokens ( $v_t^k$ ) at each timestep $t$ . To prevent overfitting and manage the gradual accumulation of errors, the optimization uses a gradual threshold mechanism ( $TH_t = \beta \cdot \exp(-t/\alpha)$ ), ensuring losses reach predefined, decreasing thresholds over time.

Furthermore, LocInv addresses a common limitation in attribute editing (e.g., changing an object's color or material). It introduces an Adjective Binding Loss ( $\mathcal{L}_{adj}$ ). Using a parser like Spacy to identify adjective-noun pairs ( $a_t^i, v_t^i$ ), this loss encourages the attention map of the adjective to align with the attention map of its corresponding noun:

$\mathcal{L}_{adj} = \sum_{i=1}^{K} \big[1-\mathrm{cos} (A_t^{v_t^i},A_t^{a_t^i}) \big]$

This loss is added to the total loss when attribute editing is required.

To ensure the original image can still be reconstructed accurately after inversion, LocInv integrates Null-Text Inversion (NTI), optimizing null-text embeddings ( $\varnothing_t$ ) at each step alongside the dynamic noun/adjective tokens. The final output of the LocInv process is the initial noise vector ( $\tilde{z}_T$ ), the set of optimized dynamic tokens ( $\{V_t\}_1^T$ ), and the optimized null-text embeddings ( $\{\varnothing_t\}_1^T$ ). These are then used with an editing method like Prompt-to-Prompt (P2P) for the actual image manipulation.

Experiments were conducted on a COCO-edit subset derived from MS-COCO, comparing LocInv (using both segmentation and detection priors) against methods like NTI, DPL, PnP, DiffEdit, MasaCtrl, pix2pix-zero, and fine-tuning/inpainting approaches. LocInv demonstrated superior performance in quantitative metrics (LPIPS, SSIM, PSNR, DINO-Sim, background preservation) and qualitative results, especially for multi-object scenes and attribute editing tasks (Word-Swap, Attribute-Edit). Ablation studies confirmed the effectiveness of the proposed losses and hyperparameters. User studies also indicated a preference for LocInv's editing quality and background preservation compared to other non-finetuning methods.

The main contribution is a method that significantly reduces cross-attention leakage by leveraging readily available localization priors, leading to more precise text-guided image editing without needing model fine-tuning, and enabling effective attribute modification.

PDF Markdown

Tweets

https://twitter.com/CSVisionPapers/status/1786613224969760887

https://twitter.com/realmofresearch/status/1786812463826059676

LocInv: Localization-aware Inversion for Text-Guided Image Editing (2405.01496v1)

Summary

Related Papers

Tweets