Improving Tuning-Free Real Image Editing with Proximal Guidance (2306.05414v3)
Abstract: DDIM inversion has revealed the remarkable potential of real image editing within diffusion-based methods. However, the accuracy of DDIM reconstruction degrades as larger classifier-free guidance (CFG) scales being used for enhanced editing. Null-text inversion (NTI) optimizes null embeddings to align the reconstruction and inversion trajectories with larger CFG scales, enabling real image editing with cross-attention control. Negative-prompt inversion (NPI) further offers a training-free closed-form solution of NTI. However, it may introduce artifacts and is still constrained by DDIM reconstruction quality. To overcome these limitations, we propose proximal guidance and incorporate it to NPI with cross-attention control. We enhance NPI with a regularization term and reconstruction guidance, which reduces artifacts while capitalizing on its training-free nature. Additionally, we extend the concepts to incorporate mutual self-attention control, enabling geometry and layout alterations in the editing process. Our method provides an efficient and straightforward approach, effectively addressing real image editing tasks with minimal computational overhead.
- Clip2stylegan: Unsupervised extraction of stylegan edit directions. In ACM SIGGRAPH 2022 Conference Proceedings, pages 1–9, 2022.
- Styleflow: Attribute-conditioned exploration of stylegan-generated images using conditional continuous normalizing flows. ACM Transactions on Graphics (ToG), 40(3):1–21, 2021.
- Break-a-scene: Extracting multiple concepts from a single image. arXiv preprint arXiv:2305.16311, 2023.
- Distributed optimization and statistical learning via the alternating direction method of multipliers. Foundations and Trends® in Machine learning, 3(1):1–122, 2011.
- Sega: Instructing diffusion using semantic dimensions. arXiv preprint arXiv:2301.12247, 2023.
- Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. arXiv preprint arXiv:2304.08465, 2023.
- Muse: Text-to-image generation via masked generative transformers. arXiv preprint arXiv:2301.00704, 2023.
- The hidden language of diffusion models. arXiv preprint arXiv:2306.00966, 2023.
- Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation, 2023.
- Revisiting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Multiresolution textual inversion. arXiv preprint arXiv:2211.17115, 2022.
- Parameter-efficient fine-tuning for medical image analysis: The missed opportunity. arXiv preprint arXiv:2305.08252, 2023.
- Jeffrey A. Fessler. Lecture notes on proximal methods. https://web.eecs.umich.edu/~fessler/course/598/l/n-05-prox.pdf, 2021.
- An image is worth one word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
- Hiclip: Contrastive language-image pretraining with hierarchy-aware attention. arXiv preprint arXiv:2303.02995, 2023.
- Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
- Robust conditional gan from uncertainty-aware pairwise comparisons. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 10909–10916, 2020.
- Svdiff: Compact parameter space for diffusion fine-tuning. arXiv preprint arXiv:2303.11305, 2023.
- Dual projection generative adversarial networks for conditional image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 14438–14447, 2021.
- Ae-stylegan: Improved training of style-based auto-encoders. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3134–3143, 2022.
- Show me what and tell me how: Video synthesis via multimodal conditioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3615–3625, 2022.
- Unbiased auxiliary classifier gans with mine. arXiv preprint arXiv:2006.07567, 2020.
- Ganspace: Discovering interpretable gan controls. Advances in Neural Information Processing Systems, 33:9841–9850, 2020.
- Delta denoising score. arXiv preprint arXiv:2304.07090, 2023.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626, 2022.
- Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
- Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Zero-day backdoor attack against text-to-image diffusion models via personalization. arXiv preprint arXiv:2305.10701, 2023.
- Avatarcraft: Transforming text into neural human avatars with parameterized shape and pose control. arXiv preprint arXiv:2303.17606, 2023.
- Multi-concept customization of text-to-image diffusion. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Soundini: Sound-guided diffusion for natural video editing. arXiv preprint arXiv:2304.06818, 2023.
- Layerdiffusion: Layered controlled image editing with diffusion models. arXiv preprint arXiv:2305.18676, 2023.
- Stylediffusion: Prompt-embedding inversion for text-based editing. arXiv preprint arXiv:2303.15649, 2023.
- Gligen: Open-set grounded text-to-image generation. arXiv preprint arXiv:2301.07093, 2023.
- Snapfusion: Text-to-image diffusion model on mobile devices within two seconds. arXiv preprint arXiv:2306.00980, 2023.
- Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models. arXiv preprint arXiv:2305.16807, 2023.
- Null-text inversion for editing real images using guided diffusion models. arXiv preprint arXiv:2211.09794, 2022.
- T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models. arXiv preprint arXiv:2302.08453, 2023.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Improved denoising diffusion probabilistic models. In International Conference on Machine Learning, pages 8162–8171. PMLR, 2021.
- Image-based modeling and photo editing. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques, pages 433–442, 2001.
- Editing implicit assumptions in text-to-image diffusion models. arXiv preprint arXiv:2303.08084, 2023.
- Styleclip: Text-driven manipulation of stylegan imagery. arXiv preprint arXiv:2103.17249, 2021.
- Fatezero: Fusing attentions for zero-shot text-based video editing. arXiv preprint arXiv:2303.09535, 2023.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- Zero-shot text-to-image generation. arXiv preprint arXiv:2102.12092, 2021.
- Image deblurring with domain generalizable diffusion models. arXiv preprint arXiv:2212.01789, 2022.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, June 2022.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. arXiv preprint arXiv:2208.12242, 2022.
- Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487, 2022.
- Instantbooth: Personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411, 2023.
- Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, pages 2256–2265. PMLR, 2015.
- Styledrop: Text-to-image generation in any style. arXiv preprint arXiv:2306.00983, 2023.
- Denoising diffusion implicit models. In International Conference on Learning Representations, 2021.
- Consistency models. arXiv preprint arXiv:2303.01469, 2023.
- Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32, 2019.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
- Df-gan: Deep fusion generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:2008.05865, 2020.
- Ryan Tibshirani. Lecture notes on proximal gradient descent. https://www.stat.cmu.edu/~ryantibs/convexopt/lectures/prox-grad.pdf, 2019.
- Plug-and-play diffusion features for text-driven image-to-image translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1921–1930, 2023.
- p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation. arXiv preprint arXiv:2303.09522, 2023.
- Compositional text-to-image synthesis with attention map control of diffusion models. arXiv preprint arXiv:2305.13921, 2023.
- Patch diffusion: Faster and more data-efficient training of diffusion models. arXiv preprint arXiv:2304.12526, 2023.
- Sin3dm: Learning a diffusion model from a single 3d textured shape. arXiv preprint arXiv:2305.15399, 2023.
- Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 1316–1324, 2018.
- Improving text-to-image synthesis using contrastive learning. arXiv preprint arXiv:2107.02423, 2021.
- Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2022.
- Multimodal image synthesis and editing: A survey. arXiv preprint arXiv:2112.13592, 2021.
- Cross-modal contrastive learning for text-to-image generation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 833–842, 2021.
- On the robustness of diffusion inversion in image manipulation. In ICLR 2023 Workshop on Trustworthy and Reliable Large-Scale Machine Learning Models, 2023.
- Prospect: Expanded conditioning for the personalization of attribute-aware image generation. arXiv preprint arXiv:2305.16225, 2023.
- Real-world image variation by aligning diffusion inversion chain. arXiv preprint arXiv:2305.18729, 2023.
- Text-to-image editing by image information removal. arXiv preprint arXiv:2305.17489, 2023.
- Dm-gan: Dynamic memory generative adversarial networks for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5802–5810, 2019.