Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LocInv: Localization-aware Inversion for Text-Guided Image Editing (2405.01496v1)

Published 2 May 2024 in cs.CV

Abstract: Large-scale Text-to-Image (T2I) diffusion models demonstrate significant generation capabilities based on textual prompts. Based on the T2I diffusion models, text-guided image editing research aims to empower users to manipulate generated images by altering the text prompts. However, existing image editing techniques are prone to editing over unintentional regions that are beyond the intended target area, primarily due to inaccuracies in cross-attention maps. To address this problem, we propose Localization-aware Inversion (LocInv), which exploits segmentation maps or bounding boxes as extra localization priors to refine the cross-attention maps in the denoising phases of the diffusion process. Through the dynamic updating of tokens corresponding to noun words in the textual input, we are compelling the cross-attention maps to closely align with the correct noun and adjective words in the text prompt. Based on this technique, we achieve fine-grained image editing over particular objects while preventing undesired changes to other regions. Our method LocInv, based on the publicly available Stable Diffusion, is extensively evaluated on a subset of the COCO dataset, and consistently obtains superior results both quantitatively and qualitatively.The code will be released at https://github.com/wangkai930418/DPL

This paper introduces LocInv (Localization-aware Inversion), a method designed to improve text-guided image editing using diffusion models like Stable Diffusion. The core problem addressed is cross-attention leakage, where existing editing techniques inadvertently modify regions outside the intended target area because the model's attention mechanism doesn't perfectly align object concepts in the text prompt with the correct spatial regions in the image. This is particularly challenging in images with multiple objects.

To combat this, LocInv incorporates localization priors – specifically, segmentation masks or bounding boxes – during the DDIM inversion process. These priors, which can be obtained from foundation models like SAM or Grounding DINO, guide the refinement of cross-attention maps.

The key mechanism involves dynamic prompt learning, where the token embeddings corresponding to noun words in the text prompt are updated at each timestep of the denoising process. This update is guided by optimization losses designed to align the cross-attention maps (AtA_t) with the provided localization priors (StS_t):

  1. Similarity Loss (Lsim\mathcal{L}_{sim}): Encourages high cosine similarity between the attention map of a noun token and its corresponding localization prior.

    Lsim=i=1K[1cos(Atvti,Stvti)]\mathcal{L}_{sim} = \sum_{i=1}^{K} \big[1-\mathrm{cos} (A_t^{v_t^i},S_t^{v_t^i}) \big ]

  2. Overlapping Loss (Lovl\mathcal{L}_{ovl}): Maximizes the portion of the attention map that falls within the localization prior region.

    Lovl=1i=1KAtvtiStvtii=1KAtvti\mathcal{L}_{ovl} = 1- \frac{\sum_{i=1}^{K} A_t^{v_t^i} \cdot S_t^{v_t^i}}{\sum_{i=1}^{K} A_t^{v_t^i}}

These losses are combined (L=λsimLsim+λovlLovl\mathcal{L} = \lambda_{sim} \mathcal{L}_{sim} + \lambda_{ovl} \mathcal{L}_{ovl}) and optimized iteratively for the noun tokens (vtkv_t^k) at each timestep tt. To prevent overfitting and manage the gradual accumulation of errors, the optimization uses a gradual threshold mechanism (THt=βexp(t/α)TH_t = \beta \cdot \exp(-t/\alpha)), ensuring losses reach predefined, decreasing thresholds over time.

Furthermore, LocInv addresses a common limitation in attribute editing (e.g., changing an object's color or material). It introduces an Adjective Binding Loss (Ladj\mathcal{L}_{adj}). Using a parser like Spacy to identify adjective-noun pairs (ati,vtia_t^i, v_t^i), this loss encourages the attention map of the adjective to align with the attention map of its corresponding noun:

Ladj=i=1K[1cos(Atvti,Atati)]\mathcal{L}_{adj} = \sum_{i=1}^{K} \big[1-\mathrm{cos} (A_t^{v_t^i},A_t^{a_t^i}) \big]

This loss is added to the total loss when attribute editing is required.

To ensure the original image can still be reconstructed accurately after inversion, LocInv integrates Null-Text Inversion (NTI), optimizing null-text embeddings (t\varnothing_t) at each step alongside the dynamic noun/adjective tokens. The final output of the LocInv process is the initial noise vector (z~T\tilde{z}_T), the set of optimized dynamic tokens ({Vt}1T\{V_t\}_1^T), and the optimized null-text embeddings ({t}1T\{\varnothing_t\}_1^T). These are then used with an editing method like Prompt-to-Prompt (P2P) for the actual image manipulation.

Experiments were conducted on a COCO-edit subset derived from MS-COCO, comparing LocInv (using both segmentation and detection priors) against methods like NTI, DPL, PnP, DiffEdit, MasaCtrl, pix2pix-zero, and fine-tuning/inpainting approaches. LocInv demonstrated superior performance in quantitative metrics (LPIPS, SSIM, PSNR, DINO-Sim, background preservation) and qualitative results, especially for multi-object scenes and attribute editing tasks (Word-Swap, Attribute-Edit). Ablation studies confirmed the effectiveness of the proposed losses and hyperparameters. User studies also indicated a preference for LocInv's editing quality and background preservation compared to other non-finetuning methods.

The main contribution is a method that significantly reduces cross-attention leakage by leveraging readily available localization priors, leading to more precise text-guided image editing without needing model fine-tuning, and enabling effective attribute modification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Blended diffusion for text-driven editing of natural images. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18208–18218, 2022.
  2. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  3. Text2live: Text-driven layered image and video editing. In European Conference on Computer Vision, pages 707–723. Springer, 2022.
  4. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  5. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. Proceedings of the International Conference on Computer Vision, 2023.
  6. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021.
  7. Muse: Text-to-image generation via masked generative transformers. International Conference on Machine Learning, 2023.
  8. Fec: Three finetuning-free methods to enhance consistency for real image editing. arXiv preprint arXiv:2309.14934, 2023.
  9. Diffusiondet: Diffusion model for object detection. arXiv preprint arXiv:2211.09788, 2022.
  10. Ilvr: Conditioning method for denoising diffusion probabilistic models. In Proceedings of the International Conference on Computer Vision, pages 14347–14356. IEEE, 2021.
  11. Diffedit: Diffusion-based semantic image editing with mask guidance. In The Eleventh International Conference on Learning Representations, 2023.
  12. Open-vocabulary panoptic segmentation with maskclip. International Conference on Machine Learning, 2023.
  13. Make-a-scene: Scene-based text-to-image generation with human priors. In European Conference on Computer Vision, pages 89–106. Springer, 2022.
  14. Scaling open-vocabulary image segmentation with image-level labels. In European Conference on Computer Vision, pages 540–557. Springer, 2022.
  15. Gradpaint: Gradient-guided inpainting with diffusion models. arXiv preprint arXiv:2309.09614, 2023.
  16. Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414, 2023.
  17. Delta denoising score. arXiv preprint arXiv:2304.07090, 2023a.
  18. Prompt-to-prompt image editing with cross attention control. International Conference on Learning Representations, 2023b.
  19. Clipscore: A reference-free evaluation metric for image captioning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7514–7528, 2021.
  20. Classifier-free diffusion guidance. NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2022.
  21. Improving sample quality of diffusion models using self-attention guidance. Proceedings of the International Conference on Computer Vision, 2023.
  22. spacy 2: Natural language understanding with bloom embeddings, convolutional neural networks and incremental parsing. To appear, 7(1):411–420, 2017.
  23. Kv inversion: Kv embeddings learning for text-conditioned real image action editing. arXiv preprint arXiv:2309.16608, 2023.
  24. Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
  25. Imagic: Text-based real image editing with diffusion models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  26. Diffusionclip: Text-guided diffusion models for robust image manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2426–2435, 2022.
  27. Segment anything. Proceedings of the International Conference on Computer Vision, 2023.
  28. Diffusion-based image translation using disentangled style and content representation. In The Eleventh International Conference on Learning Representations, 2023.
  29. Stylediffusion: Prompt-embedding inversion for text-based editing, 2023.
  30. Microsoft coco: Common objects in context. In European Conference on Computer Vision, pages 740–755. Springer, 2014.
  31. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. In arXiv preprint arXiv:2303.05499, 2023.
  32. Image segmentation using text and image prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7086–7096, 2022.
  33. Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11461–11471, 2022.
  34. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations, 2022.
  35. Midjourney.com. Midjourney. https://www.midjourney.com, 2022.
  36. Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
  37. Null-text inversion for editing real images using guided diffusion models. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  38. GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. In Proceedings of the 39th International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  39. Zero-shot image-to-image translation. Proceedings of the ACM SIGGRAPH Conference on Computer Graphics, 2023.
  40. Localizing object-level shape variations with text-to-image diffusion models. Proceedings of the International Conference on Computer Vision, 2023.
  41. Ld-znet: A latent diffusion approach for text-based image segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4157–4168, 2023.
  42. Sdxl: improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  43. Zero-shot text-to-image generation. In International Conference on Machine Learning, pages 8821–8831. PMLR, 2021.
  44. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
  45. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  46. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  47. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 2022.
  48. Deepfloyd-if. https://github.com/deep-floyd/IF, 2023.
  49. Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
  50. Iterinv: Iterative inversion for pixel-level t2i models. Neurips 2023 workshop on Diffusion Models, 2023.
  51. Splicing vit features for semantic appearance transfer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10748–10757, 2022.
  52. Plug-and-play diffusion features for text-driven image-to-image translation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2023.
  53. Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing. Advances in Neural Information Processing Systems, 2023a.
  54. Mdp: A generalized framework for text-guided image editing by manipulating the diffusion path, 2023b.
  55. Multiscale structural similarity for image quality assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers, 2003, pages 1398–1402. Ieee, 2003.
  56. Detecting everything in the open world: Towards universal object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11433–11443, 2023c.
  57. Diffusion-based network for unsupervised landmark detection. Knowledge-Based Systems, page 111627, 2024.
  58. Groupvit: Semantic segmentation emerges from text supervision. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18134–18144, 2022.
  59. Open-vocabulary panoptic segmentation with text-to-image diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2955–2966, 2023.
  60. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790, 2023.
  61. The unreasonable effectiveness of deep features as a perceptual metric. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 586–595, 2018.
  62. Forgedit: Text guided image editing via learning and forgetting. arXiv preprint arXiv:2309.10556, 2023.
  63. Shifted diffusion for text-to-image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10157–10166, 2023.
  64. Generalized decoding for pixel, image, and language. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15116–15127, 2023a.
  65. Segment everything everywhere all at once. Advances in Neural Information Processing Systems, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Chuanming Tang (9 papers)
  2. Kai Wang (624 papers)
  3. Fei Yang (110 papers)
  4. Joost van de Weijer (133 papers)
Citations (2)