UniTune: Text-Driven Image Editing by Fine Tuning a Diffusion Model on a Single Image (2210.09477v4)
Abstract: Text-driven image generation methods have shown impressive results recently, allowing casual users to generate high quality images by providing textual descriptions. However, similar capabilities for editing existing images are still out of reach. Text-driven image editing methods usually need edit masks, struggle with edits that require significant visual changes and cannot easily keep specific details of the edited portion. In this paper we make the observation that image-generation models can be converted to image-editing models simply by fine-tuning them on a single image. We also show that initializing the stochastic sampler with a noised version of the base image before the sampling and interpolating relevant details from the base image after sampling further increase the quality of the edit operation. Combining these observations, we propose UniTune, a novel image editing method. UniTune gets as input an arbitrary image and a textual edit description, and carries out the edit while maintaining high fidelity to the input image. UniTune does not require additional inputs, like masks or sketches, and can perform multiple edits on the same image without retraining. We test our method using the Imagen model in a range of different use cases. We demonstrate that it is broadly applicable and can perform a surprisingly wide range of expressive editing operations, including those requiring significant visual changes that were previously impossible.
- CLIP2StyleGAN: Unsupervised Extraction of StyleGAN Edit Directions. https://doi.org/10.48550/ARXIV.2112.05219
- Blended Latent Diffusion. https://doi.org/10.48550/ARXIV.2206.02779
- Blended Diffusion for Text-driven Editing of Natural Images. https://doi.org/10.48550/ARXIV.2111.14818
- Text2LIVE: Text-Driven Layered Image and Video Editing. arXiv preprint arXiv:2204.02491 (2022).
- Paint by Word. https://doi.org/10.48550/ARXIV.2103.10951
- Neural Photo Editing with Introspective Adversarial Networks. https://doi.org/10.48550/ARXIV.1609.07093
- InstructPix2Pix: Learning to Follow Image Editing Instructions. arXiv:cs.CV/2211.09800
- ILVR: Conditioning Method for Denoising Diffusion Probabilistic Models. https://doi.org/10.48550/ARXIV.2108.02938
- Prafulla Dhariwal and Alex Nichol. 2021. Diffusion Models Beat GANs on Image Synthesis. https://doi.org/10.48550/ARXIV.2105.05233
- DreamArtist: Towards Controllable One-Shot Text-to-Image Generation via Contrastive Prompt-Tuning. arXiv:cs.CV/2211.11337
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. https://doi.org/10.48550/ARXIV.2208.01618
- StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators. arXiv:cs.CV/2108.00946
- An Empirical Investigation of Catastrophic Forgetting in Gradient-Based Neural Networks. https://doi.org/10.48550/ARXIV.1312.6211
- Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1406.2661
- Prompt-to-Prompt Image Editing with Cross Attention Control. https://doi.org/10.48550/ARXIV.2208.01626
- Imagen Video: High Definition Video Generation with Diffusion Models. https://doi.org/10.48550/ARXIV.2210.02303
- Denoising Diffusion Probabilistic Models. https://doi.org/10.48550/ARXIV.2006.11239
- Jonathan Ho and Tim Salimans. 2022. Classifier-Free Diffusion Guidance. https://doi.org/10.48550/ARXIV.2207.12598
- A Style-Based Generator Architecture for Generative Adversarial Networks. https://doi.org/10.48550/ARXIV.1812.04948
- Imagic: Text-Based Real Image Editing with Diffusion Models. arXiv:cs.CV/2210.09276
- DiffusionCLIP: Text-Guided Diffusion Models for Robust Image Manipulation. https://doi.org/10.48550/ARXIV.2110.02711
- More Control for Free! Image Synthesis with Semantic Diffusion Guidance. https://doi.org/10.48550/ARXIV.2112.05744
- SDEdit: Guided Image Synthesis and Editing with Stochastic Differential Equations. https://doi.org/10.48550/ARXIV.2108.01073
- GLIDE: Towards Photorealistic Image Generation and Editing with Text-Guided Diffusion Models. https://doi.org/10.48550/ARXIV.2112.10741
- StyleCLIP: Text-Driven Manipulation of StyleGAN Imagery. https://doi.org/10.48550/ARXIV.2103.17249
- Learning Transferable Visual Models From Natural Language Supervision. https://doi.org/10.48550/ARXIV.2103.00020
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. Journal of Machine Learning Research 21, 140 (2020), 1–67. http://jmlr.org/papers/v21/20-074.html
- Hierarchical Text-Conditional Image Generation with CLIP Latents. https://doi.org/10.48550/ARXIV.2204.06125
- Pivotal Tuning for Latent-based Editing of Real Images. https://doi.org/10.48550/ARXIV.2106.05744
- High-Resolution Image Synthesis with Latent Diffusion Models. arXiv:cs.CV/2112.10752
- U-Net: Convolutional Networks for Biomedical Image Segmentation. https://doi.org/10.48550/ARXIV.1505.04597
- DreamBooth: Fine Tuning Text-to-image Diffusion Models for Subject-Driven Generation.
- Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. https://doi.org/10.48550/ARXIV.2205.11487
- Image Super-Resolution via Iterative Refinement. arXiv:eess.IV/2104.07636
- Deep Unsupervised Learning using Nonequilibrium Thermodynamics. https://doi.org/10.48550/ARXIV.1503.03585
- Yang Song and Stefano Ermon. 2019. Generative Modeling by Estimating Gradients of the Data Distribution. https://doi.org/10.48550/ARXIV.1907.05600
- Conditional Image Generation and Manipulation for User-Specified Content. https://doi.org/10.48550/ARXIV.2005.04909
- Pretraining is All You Need for Image-to-Image Translation. In arXiv.
- TediGAN: Text-Guided Diverse Face Image Generation and Manipulation. https://doi.org/10.48550/ARXIV.2012.03308
- GAN Inversion: A Survey. https://doi.org/10.48550/ARXIV.2101.05278
- Generative Visual Manipulation on the Natural Image Manifold. https://doi.org/10.48550/ARXIV.1609.03552
- Dani Valevski (5 papers)
- Matan Kalman (3 papers)
- Eyal Molad (2 papers)
- Eyal Segalis (2 papers)
- Yossi Matias (61 papers)
- Yaniv Leviathan (8 papers)