MAG-Edit: Localized Image Editing in Complex Scenarios via Mask-Based Attention-Adjusted Guidance (2312.11396v2)
Abstract: Recent diffusion-based image editing approaches have exhibited impressive editing capabilities in images with simple compositions. However, localized editing in complex scenarios has not been well-studied in the literature, despite its growing real-world demands. Existing mask-based inpainting methods fall short of retaining the underlying structure within the edit region. Meanwhile, mask-free attention-based methods often exhibit editing leakage and misalignment in more complex compositions. In this work, we develop MAG-Edit, a training-free, inference-stage optimization method, which enables localized image editing in complex scenarios. In particular, MAG-Edit optimizes the noise latent feature in diffusion models by maximizing two mask-based cross-attention constraints of the edit token, which in turn gradually enhances the local alignment with the desired prompt. Extensive quantitative and qualitative experiments demonstrate the effectiveness of our method in achieving both text alignment and structure preservation for localized editing within complex scenarios.
- Blended diffusion for text-driven editing of natural images. In CVPR, pages 18208–18218, 2022.
- Blended latent diffusion. ACM TOG, pages 1–11, 2023.
- Instructpix2pix: Learning to follow image editing instructions. In CVPR, pages 18392–18402, 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. In ICCV, pages 22560–22570, 2023.
- Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models. In SIGGRAPH, pages 1–24, 2023.
- Training-free layout control with cross-attention guidance. arXiv preprint arXiv:2304.03373, 2023.
- The cityscapes dataset for semantic urban scene understanding. In CVPR, pages 3213–3223, 2016.
- Diffedit: Diffusion-based semantic image editing with mask guidance. In ICLR, pages 1–22, 2023.
- Improving negative-prompt inversion via proximal guidance. arXiv preprint arXiv:2306.05414, 2023.
- Prompt-to-prompt image editing with cross attention control. In ICLR, pages 1–36, 2023.
- Classifier-free diffusion guidance. In NeruIPS workshop, pages 1–14, 2021.
- Pfb-diff: Progressive feature blending diffusion for text-driven image editing. arXiv preprint arXiv:2306.16894, 2023.
- Direct inversion: Boosting diffusion-based editing with 3 lines of code. arXiv preprint arXiv:2310.01506, 2023.
- Imagic: Text-based real image editing with diffusion models. In CVPR, pages 6007–6017, 2023.
- Stylediffusion: Prompt-embedding inversion for text-based editing. arXiv preprint arXiv:2303.15649, 2023.
- Microsoft coco: Common objects in context. In ECCV, pages 740–755, 2014.
- Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
- Null-text inversion for editing real images using guided diffusion models. In CVPR, pages 6038–6047, 2023.
- OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
- Zero-shot image-to-image translation. In SIGGRAPH, pages 1–11, 2023.
- Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, pages 10684–10695, 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In CVPR, pages 22500–22510, 2023.
- Photorealistic text-to-image diffusion models with deep language understanding. In NeurIPS, pages 36479–36494, 2022.
- Denoising diffusion implicit models. In ICLR, pages 1–20, 2021.
- Splicing vit features for semantic appearance transfer. In CVPR, pages 10748–10757, 2022.
- Plug-and-play diffusion features for text-driven image-to-image translation. In CVPR, pages 1921–1930, 2023.
- Instructedit: Improving automatic masks for diffusion-based image editing with user instructions. arXiv preprint arXiv:2305.18047, 2023.
- Boxdiff: Text-to-image synthesis with training-free box-constrained diffusion. In ICCV, pages 7452–7461, 2023.
- Magicbrush: A manually annotated dataset for instruction-guided image editing. In NeurIPS, 2023a.
- Sine: Single image editing with text-to-image diffusion models. In CVPR, pages 6027–6037, 2023b.
- Scene parsing through ade20k dataset. In CVPR, pages 633–641, 2017.
- Qi Mao (22 papers)
- Lan Chen (77 papers)
- Yuchao Gu (26 papers)
- Zhen Fang (58 papers)
- Mike Zheng Shou (165 papers)