Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TiNO-Edit: Timestep and Noise Optimization for Robust Diffusion-Based Image Editing (2404.11120v1)

Published 17 Apr 2024 in cs.CV

Abstract: Despite many attempts to leverage pre-trained text-to-image models (T2I) like Stable Diffusion (SD) for controllable image editing, producing good predictable results remains a challenge. Previous approaches have focused on either fine-tuning pre-trained T2I models on specific datasets to generate certain kinds of images (e.g., with a specific object or person), or on optimizing the weights, text prompts, and/or learning features for each input image in an attempt to coax the image generator to produce the desired result. However, these approaches all have shortcomings and fail to produce good results in a predictable and controllable manner. To address this problem, we present TiNO-Edit, an SD-based method that focuses on optimizing the noise patterns and diffusion timesteps during editing, something previously unexplored in the literature. With this simple change, we are able to generate results that both better align with the original images and reflect the desired result. Furthermore, we propose a set of new loss functions that operate in the latent domain of SD, greatly speeding up the optimization when compared to prior approaches, which operate in the pixel domain. Our method can be easily applied to variations of SD including Textual Inversion and DreamBooth that encode new concepts and incorporate them into the edited results. We present a host of image-editing capabilities enabled by our approach. Our code is publicly available at https://github.com/SherryXTChen/TiNO-Edit.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Blended latent diffusion. ACM Transactions on Graphics (TOG), 42(4):1–11, 2023.
  2. Text2LIVE: Text-driven layered image and video editing. In European Conference on Computer Vision (ECCV), pages 707–723. Springer, 2022.
  3. InstructPix2Pix: Learning to follow image editing instructions. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 18392–18402, 2023.
  4. Language models are few-shot learners. Advances in Neural Information Processing Systems (NeurIPS), 33:1877–1901, 2020.
  5. Emerging properties in self-supervised vision transformers. In International Conference on Computer Vision (ICCV), 2021.
  6. General image-to-image translation with one-shot image guidance. In International Conference on Computer Vision (ICCV), pages 22736–22746, 2023.
  7. Reproducible scaling laws for contrastive language-image learning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 2818–2829, 2023.
  8. DiffEdit: Diffusion-based semantic image editing with mask guidance. arXiv preprint arXiv:2210.11427, 2022.
  9. VQGAN-CLIP: Open domain image generation and editing with natural language guidance. In European Conference on Computer Vision (ECCV), pages 88–105. Springer, 2022.
  10. Diffusion models beat GANs on image synthesis. Advances in Neural Information Processing Systems (NeurIPS), 34:8780–8794, 2021.
  11. Diffusion self-guidance for controllable image generation. arXiv preprint arXiv:2306.00986, 2023.
  12. Taming transformers for high-resolution image synthesis. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021.
  13. An Image is Worth One Word: Personalizing text-to-image generation using textual inversion. arXiv preprint arXiv:2208.01618, 2022.
  14. EditAnything: Empowering unparalleled flexibility in image editing and generation. In ACM International Conference on Multimedia (ACMMM), 2023.
  15. HuggingFace. HuggingFace Stable diffusion image-to-image, 2023. https://huggingface.co/docs/diffusers/api/pipelines/stable_diffusion/img2img.
  16. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  17. GLIGEN: Open-set grounded text-to-image generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 22511–22521, 2023.
  18. Image segmentation using text and image prompts. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 7086–7096, 2022.
  19. Latent consistency models: Synthesizing high-resolution images with few-step inference. arXiv preprint arXiv:2310.04378, 2023.
  20. SDEdit: Guided image synthesis and editing with stochastic differential equations. In International Conference on Learning Representations (ICLR), 2022.
  21. Null-text inversion for editing real images using guided diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6038–6047, 2023.
  22. Zero-shot image-to-image translation. In SIGGRAPH, pages 1–11, 2023.
  23. SDXL: Improving latent diffusion models for high-resolution image synthesis. arXiv preprint arXiv:2307.01952, 2023.
  24. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021.
  25. High-resolution image synthesis with latent diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10684–10695, 2022.
  26. U-Net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer, 2015.
  27. DreamBooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 22500–22510, 2023.
  28. LAION-5B: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022.
  29. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014.
  30. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  31. Plug-and-play diffusion features for text-driven image-to-image translation. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1921–1930, 2023.
  32. EDICT: Exact diffusion inversion via coupled transformations. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 22532–22541, 2023.
  33. Uncovering the disentanglement capability in text-to-image diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1900–1910, 2023.
  34. Paint by Example: Exemplar-based image editing with diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 18381–18391, 2023.
  35. Adding conditional control to text-to-image diffusion models. In International Conference on Computer Vision (ICCV), pages 3836–3847, 2023a.
  36. SINE: Single image editing with text-to-image diffusion models. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 6027–6037, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Sherry X. Chen (4 papers)
  2. Yaron Vaxman (2 papers)
  3. Elad Ben Baruch (3 papers)
  4. David Asulin (2 papers)
  5. Aviad Moreshet (3 papers)
  6. Kuo-Chin Lien (6 papers)
  7. Misha Sra (37 papers)
  8. Pradeep Sen (23 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com