Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion Model-Based Image Editing: A Survey (2402.17525v2)

Published 27 Feb 2024 in cs.CV

Abstract: Denoising diffusion models have emerged as a powerful tool for various image generation and editing tasks, facilitating the synthesis of visual content in an unconditional or input-conditional manner. The core idea behind them is learning to reverse the process of gradually adding noise to images, allowing them to generate high-quality samples from a complex distribution. In this survey, we provide an exhaustive overview of existing methods using diffusion models for image editing, covering both theoretical and practical aspects in the field. We delve into a thorough analysis and categorization of these works from multiple perspectives, including learning strategies, user-input conditions, and the array of specific editing tasks that can be accomplished. In addition, we pay special attention to image inpainting and outpainting, and explore both earlier traditional context-driven and current multimodal conditional methods, offering a comprehensive analysis of their methodologies. To further evaluate the performance of text-guided image editing algorithms, we propose a systematic benchmark, EditEval, featuring an innovative metric, LMM Score. Finally, we address current limitations and envision some potential directions for future research. The accompanying repository is released at https://github.com/SiatMMLab/Awesome-Diffusion-Model-Based-Image-Editing-Methods.

An Overview of Diffusion Model-Based Image Editing: Methodologies and Future Directions

The rapid advancements in denoising diffusion models have paved the way for significant developments in the field of image editing, a crucial subdomain in AI-generated content (AIGC). The paper "Diffusion Model-Based Image Editing: A Survey" presents a comprehensive examination of the role diffusion models play in enabling complex image editing tasks. The paper not only categorizes existing methodologies but also addresses the challenges and potential future advancements in this vibrant research domain.

The authors classify diffusion model-based image editing methods into three prominent categories based on their learning strategies: training-based approaches, testing-time finetuning, and models that are both training and finetuning free. Training-based approaches are further subdivided into domain-specific editing methods that utilize CLIP guidance, cycling regularization, projection and interpolation, and classifier guidance to enhance model capabilities in specific domains. These methods are particularly beneficial for tasks such as semantic and stylistic editing, where generating nuanced artistic styles or performing unpaired image-to-image translations is required.

Testing-time finetuning methods offer precise control over image edits by finetuning specific layers or embeddings of a model. Approaches like denoising model finetuning, embedding adjustment, latent variable optimization, and hybrid finetuning highlight the scope of achieving fine-grained edits with minimal computational overhead, making them suitable for real-time applications.

In contrast, training and finetuning free methods leverage the inherent principles of diffusion models, focusing on techniques such as formulating user prompts, modifying inversion and sampling processes, or employing mask-guided techniques to achieve desired image alterations without retraining the model. These methods highlight the versatility and usability of diffusion models in practical settings.

The paper places significant emphasis on the tasks of image inpainting and outpainting, aligning traditional context-driven methods with contemporary multimodal conditional approaches that utilize text, segmentation maps, or reference images for guidance. The latter methods, particularly, illustrate how pretrained diffusion models can be fine-tuned to address complex tasks with enhanced precision, underscoring the models' adaptability.

Evaluation of these methodologies is supported by EditEval, a benchmark introduced in the paper for assessing diffusion-based image editing. It features LMM Score, an innovative metric designed to quantify editing performance across tasks, reinforcing the importance of standardized evaluations to advance field research.

Despite recent progress, the field faces several challenges, such as the need for fewer-step model inference, efficient model architectures, and the ability to handle complex object structures, lighting, and shadows. Robustness remains an ongoing concern, with methods often struggling to maintain consistency across diverse scenarios. The authors advocate for developing metrics beyond traditional user studies, suggesting directions involving large multimodal models for more comprehensive evaluations.

In conclusion, the survey highlights the substantial potential and transformative impact of diffusion models in image editing. By offering a detailed exploration of the existing methodologies and pinpointing areas necessitating further research, it sets the stage for future advancements that promise to enhance the fidelity and versatility of image editing technologies in the AIGC domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (296)
  1. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in NeurIPS, vol. 27, 2014.
  2. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in CVPR, pp. 4401–4410, 2019.
  3. M. Mirza and S. Osindero, “Conditional generative adversarial nets,” arXiv preprint arXiv:1411.1784, 2014.
  4. P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarial networks,” in CVPR, pp. 1125–1134, 2017.
  5. J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in ICCV, pp. 2223–2232, 2017.
  6. A. Brock, J. Donahue, and K. Simonyan, “Large scale gan training for high fidelity natural image synthesis,” in ICLR, 2019.
  7. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  8. A. Van Den Oord, N. Kalchbrenner, and K. Kavukcuoglu, “Pixel recurrent neural networks,” in ICML, pp. 1747–1756, 2016.
  9. D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in ICML, pp. 1530–1538, 2015.
  10. A. Van Den Oord, O. Vinyals, et al., “Neural discrete representation learning,” in NeurIPS, vol. 30, 2017.
  11. G. Papamakarios, T. Pavlakou, and I. Murray, “Masked autoregressive flow for density estimation,” in NeurIPS, vol. 30, 2017.
  12. P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in CVPR, pp. 12873–12883, 2021.
  13. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in ICML, pp. 8821–8831, 2021.
  14. J. Zhao, M. Mathieu, and Y. LeCun, “Energy-based generative adversarial network,” arXiv preprint arXiv:1609.03126, 2016.
  15. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in ICML, 2015.
  16. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in NeurIPS, vol. 33, pp. 6840–6851, 2020.
  17. A. Q. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in ICML, pp. 8162–8171, 2021.
  18. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in ICLR, 2021.
  19. A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.,” JMLR, vol. 6, no. 4, 2005.
  20. P. Vincent, “A connection between score matching and denoising autoencoders,” Neural computation, vol. 23, no. 7, pp. 1661–1674, 2011.
  21. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” in NeurIPS, vol. 32, 2019.
  22. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in ICLR, 2021.
  23. Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” in NeurIPS, vol. 33, pp. 12438–12448, 2020.
  24. J. Shi, C. Wu, J. Liang, X. Liu, and N. Duan, “Divae: Photorealistic images synthesis with denoising diffusion decoder,” arXiv preprint arXiv:2206.00386, 2022.
  25. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  26. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in CVPR, 2022.
  27. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. Denton, S. K. S. Ghasemipour, B. K. Ayan, S. S. Mahdavi, R. G. Lopes, et al., “Photorealistic text-to-image diffusion models with deep language understanding,” in NeurIPS, 2022.
  28. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  29. G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann, “Conditional image generation with score-based diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
  30. F. Bao, C. Li, J. Zhu, and B. Zhang, “Analytic-dpm: an analytic estimate of the optimal reverse variance in diffusion probabilistic models,” in ICLR, 2022.
  31. J. Ho, C. Saharia, W. Chan, D. J. Fleet, M. Norouzi, and T. Salimans, “Cascaded diffusion models for high fidelity image generation,” JMLR, vol. 23, no. 1, pp. 2249–2281, 2022.
  32. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” in NeurIPS, vol. 34, pp. 8780–8794, 2021.
  33. X. Liu, D. H. Park, S. Azadi, G. Zhang, A. Chopikyan, Y. Hu, H. Shi, A. Rohrbach, and T. Darrell, “More control for free! image synthesis with semantic diffusion guidance,” in WACV, pp. 289–299, 2023.
  34. A. Q. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. Mcgrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” in ICML, pp. 16784–16804, 2022.
  35. R. Rombach, A. Blattmann, and B. Ommer, “Text-guided synthesis of artistic images with retrieval-augmented diffusion models,” arXiv preprint arXiv:2207.13038, 2022.
  36. A. Bansal, E. Borgnia, H.-M. Chu, J. S. Li, H. Kazemi, F. Huang, M. Goldblum, J. Geiping, and T. Goldstein, “Cold diffusion: Inverting arbitrary image transforms without noise,” arXiv preprint arXiv:2208.09392, 2022.
  37. C. Meng, R. Rombach, R. Gao, D. Kingma, S. Ermon, J. Ho, and T. Salimans, “On distillation of guided diffusion models,” in CVPR, pp. 14297–14306, 2023.
  38. H. Phung, Q. Dao, and A. Tran, “Wavelet diffusion models are fast and scalable image generators,” in CVPR, pp. 10199–10208, 2023.
  39. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” in NeurIPS, 2022.
  40. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni, et al., “Make-a-video: Text-to-video generation without text-video data,” in ICLR, 2023.
  41. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet, et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  42. S. Ge, S. Nah, G. Liu, T. Poon, A. Tao, B. Catanzaro, D. Jacobs, J.-B. Huang, M.-Y. Liu, and Y. Balaji, “Preserve your own correlation: A noise prior for video diffusion models,” in ICCV, 2023.
  43. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in CVPR, 2023.
  44. D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
  45. S. Yin, C. Wu, H. Yang, J. Wang, X. Wang, M. Ni, Z. Yang, L. Li, S. Liu, F. Yang, et al., “Nuwa-xl: Diffusion over diffusion for extremely long video generation,” arXiv preprint arXiv:2303.12346, 2023.
  46. P. Esser, J. Chiu, P. Atighehchian, J. Granskog, and A. Germanidis, “Structure and content-guided video synthesis with diffusion models,” in ICCV, 2023.
  47. J. An, S. Zhang, H. Yang, S. Gupta, J.-B. Huang, J. Luo, and X. Yin, “Latent-shift: Latent diffusion with temporal shift for efficient text-to-video generation,” arXiv preprint arXiv:2304.08477, 2023.
  48. J. Wang, H. Yuan, D. Chen, Y. Zhang, X. Wang, and S. Zhang, “Modelscope text-to-video technical report,” arXiv preprint arXiv:2308.06571, 2023.
  49. X. Li, W. Chu, Y. Wu, W. Yuan, F. Liu, Q. Zhang, F. Li, H. Feng, E. Ding, and J. Wang, “Videogen: A reference-guided latent diffusion approach for high definition text-to-video generation,” arXiv preprint arXiv:2309.00398, 2023.
  50. Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity video generation with arbitrary lengths,” arXiv preprint arXiv:2211.13221, 2022.
  51. Y. Wang, X. Chen, X. Ma, S. Zhou, Z. Huang, Y. Wang, C. Yang, Y. He, J. Yu, P. Yang, et al., “Lavie: High-quality video generation with cascaded latent diffusion models,” arXiv preprint arXiv:2309.15103, 2023.
  52. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in ICCV, 2023.
  53. J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” arXiv preprint arXiv:2311.12631, 2023.
  54. D. J. Zhang, J. Z. Wu, J.-W. Liu, R. Zhao, L. Ran, Y. Gu, D. Gao, and M. Z. Shou, “Show-1: Marrying pixel and latent diffusion models for text-to-video generation,” arXiv preprint arXiv:2309.15818, 2023.
  55. A. Blattmann, T. Dockhorn, S. Kulal, D. Mendelevitch, M. Kilian, D. Lorenz, Y. Levi, Z. English, V. Voleti, A. Letts, et al., “Stable video diffusion: Scaling latent video diffusion models to large datasets,” arXiv preprint arXiv:2311.15127, 2023.
  56. R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
  57. C. Saharia, J. Ho, W. Chan, T. Salimans, D. J. Fleet, and M. Norouzi, “Image super-resolution via iterative refinement,” IEEE TPAMI, vol. 45, no. 4, pp. 4713–4726, 2022.
  58. C. Saharia, W. Chan, H. Chang, C. Lee, J. Ho, T. Salimans, D. Fleet, and M. Norouzi, “Palette: Image-to-image diffusion models,” in ACM SIGGRAPH, pp. 1–10, 2022.
  59. O. Özdenizci and R. Legenstein, “Restoring vision in adverse weather conditions with patch-based denoising diffusion models,” IEEE TPAMI, 2023.
  60. S. Shang, Z. Shan, G. Liu, and J. Zhang, “Resdiff: Combining cnn and diffusion model for image super-resolution,” arXiv preprint arXiv:2303.08714, 2023.
  61. S. Gao, X. Liu, B. Zeng, S. Xu, Y. Li, X. Luo, J. Liu, X. Zhen, and B. Zhang, “Implicit diffusion models for continuous super-resolution,” in CVPR, pp. 10021–10030, 2023.
  62. L. Guo, C. Wang, W. Yang, S. Huang, Y. Wang, H. Pfister, and B. Wen, “Shadowdiffusion: When degradation prior meets diffusion model for shadow removal,” in CVPR, pp. 14049–14058, 2023.
  63. Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Image restoration with mean-reverting stochastic differential equations,” in ICML, 2023.
  64. B. Xia, Y. Zhang, S. Wang, Y. Wang, X. Wu, Y. Tian, W. Yang, and L. Van Gool, “Diffir: Efficient diffusion model for image restoration,” arXiv preprint arXiv:2303.09472, 2023.
  65. J. Choi, S. Kim, Y. Jeong, Y. Gwon, and S. Yoon, “Ilvr: Conditioning method for denoising diffusion probabilistic models,” in CVPR, pp. 14367–14376, 2021.
  66. B. Kawar, M. Elad, S. Ermon, and J. Song, “Denoising diffusion restoration models,” in NeurIPS, vol. 35, pp. 23593–23606, 2022.
  67. Y. Huang, J. Huang, J. Liu, Y. Dong, J. Lv, and S. Chen, “Wavedm: Wavelet-based diffusion models for image restoration,” IEEE TMM, 2024.
  68. Y. Wang, J. Yu, and J. Zhang, “Zero-shot image restoration using denoising diffusion null-space model,” in ICLR, 2023.
  69. Z. Yue and C. C. Loy, “Difface: Blind face restoration with diffused error contraction,” arXiv preprint arXiv:2212.06512, 2022.
  70. H. Chung, B. Sim, D. Ryu, and J. C. Ye, “Improving diffusion models for inverse problems using manifold constraints,” in NeurIPS, vol. 35, pp. 25683–25696, 2022.
  71. H. Chung, J. Kim, M. T. Mccann, M. L. Klasky, and J. C. Ye, “Diffusion posterior sampling for general noisy inverse problems,” in ICLR, 2023.
  72. A. Kazerouni, E. K. Aghdam, M. Heidari, R. Azad, M. Fayyaz, I. Hacihaliloglu, and D. Merhof, “Diffusion models for medical image analysis: A comprehensive survey,” arXiv preprint arXiv:2211.07804, 2022.
  73. Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “A survey on video diffusion models,” arXiv preprint arXiv:2310.10647, 2023.
  74. X. Li, Y. Ren, X. Jin, C. Lan, X. Wang, W. Zeng, X. Wang, and Z. Chen, “Diffusion models for image restoration and enhancement–a comprehensive survey,” arXiv preprint arXiv:2308.09388, 2023.
  75. B. B. Moser, A. S. Shanbhag, F. Raue, S. Frolov, S. Palacio, and A. Dengel, “Diffusion models, image super-resolution and everything: A survey,” arXiv preprint arXiv:2401.00736, 2024.
  76. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
  77. H. Cao, C. Tan, Z. Gao, Y. Xu, G. Chen, P.-A. Heng, and S. Z. Li, “A survey on generative diffusion model,” arXiv preprint arXiv:2209.02646, 2022.
  78. F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE TPAMI, 2023.
  79. A. Ulhaq, N. Akhtar, and G. Pogrebna, “Efficient diffusion models for vision: A survey,” arXiv preprint arXiv:2210.09292, 2022.
  80. C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion model in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.
  81. T. Zhang, Z. Wang, J. Huang, M. M. Tasnim, and W. Shi, “A survey of diffusion based image generation models: Issues and their solutions,” arXiv preprint arXiv:2308.13142, 2023.
  82. R. Po, W. Yifan, V. Golyanik, K. Aberman, J. T. Barron, A. H. Bermano, E. R. Chan, T. Dekel, A. Holynski, A. Kanazawa, et al., “State of the art on diffusion models for visual computing,” arXiv preprint arXiv:2310.07204, 2023.
  83. H. Koo and T. E. Kim, “A comprehensive survey on generative diffusion models for structured data,” ArXiv, abs/2306.04139 v2, 2023.
  84. C. Meng, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Image synthesis and editing with stochastic differential equations,” in ICLR, 2022.
  85. M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023.
  86. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
  87. C.-H. Chao, W.-F. Sun, B.-W. Cheng, Y.-C. Lo, C.-C. Chang, Y.-L. Liu, Y.-L. Chang, C.-P. Chen, and C.-Y. Lee, “Denoising likelihood score matching for conditional score-based data generation,” in ICLR, 2022.
  88. T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in NeurIPS, vol. 35, pp. 26565–26577, 2022.
  89. C. Lu, Y. Zhou, F. Bao, J. Chen, C. Li, and J. Zhu, “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” in NeurIPS, vol. 35, pp. 5775–5787, 2022.
  90. T. Salimans and J. Ho, “Progressive distillation for fast sampling of diffusion models,” in ICLR, 2022.
  91. S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in CVPR, pp. 10696–10706, 2022.
  92. D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.
  93. J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu, et al., “Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” arXiv preprint arXiv:2310.00426, 2023.
  94. X. Dai, J. Hou, C.-Y. Ma, S. Tsai, J. Wang, R. Wang, P. Zhang, S. Vandenhende, X. Wang, A. Dubey, et al., “Emu: Enhancing image generation models using photogenic needles in a haystack,” arXiv preprint arXiv:2309.15807, 2023.
  95. K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li, et al., “Styledrop: Text-to-image generation in any style,” arXiv preprint arXiv:2306.00983, 2023.
  96. Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro, et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
  97. S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in CVPR, pp. 7545–7556, 2023.
  98. Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in CVPR, pp. 22511–22521, 2023.
  99. O. Gafni, A. Polyak, O. Ashual, S. Sheynin, D. Parikh, and Y. Taigman, “Make-a-scene: Scene-based text-to-image generation with human priors,” in ECCV, pp. 89–106, 2022.
  100. O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh, D. Lischinski, O. Fried, and X. Yin, “Spatext: Spatio-textual representation for controllable image generation,” in CVPR, pp. 18370–18380, 2023.
  101. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in ICCV, pp. 3836–3847, 2023.
  102. S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv preprint arXiv:2305.16322, 2023.
  103. C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese, et al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,” arXiv preprint arXiv:2305.11147, 2023.
  104. L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” in ICML, 2023.
  105. C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
  106. R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” in ICLR, 2023.
  107. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in CVPR, pp. 22500–22510, 2023.
  108. N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in CVPR, pp. 1931–1941, 2023.
  109. A. Voronov, M. Khoroshikh, A. Babenko, and M. Ryabinin, “Is this loss informative? faster text-to-image customization by tracking objective dynamics,” in NeurIPS, 2023.
  110. Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848, 2023.
  111. Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion models for customized generation,” in ICML, 2023.
  112. W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
  113. J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
  114. Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” in ACM SIGGRAPH, pp. 1–11, 2023.
  115. H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Identity-preserving disentangled tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374, vol. 2, 2023.
  116. J. Gu, Y. Wang, N. Zhao, T.-J. Fu, W. Xiong, Q. Liu, Z. Zhang, H. Zhang, J. Zhang, H. Jung, et al., “Photoswap: Personalized subject swapping in images,” arXiv preprint arXiv:2305.18286, 2023.
  117. Z. Yuan, M. Cao, X. Wang, Z. Qi, C. Yuan, and Y. Shan, “Customnet: Zero-shot object customization with variable-viewpoints in text-to-image diffusion models,” arXiv preprint arXiv:2310.19784, 2023.
  118. H. Li, Y. Yang, M. Chang, H. Feng, Z. hai Xu, Q. Li, and Y. ting Chen, “Srdiff: Single image super-resolution with diffusion probabilistic models,” Neurocomputing, vol. 479, pp. 47–59, 2022.
  119. M. Özbey, S. U. Dar, H. A. Bedel, O. Dalmaz, Ş. Özturk, A. Güngör, and T. Çukur, “Unsupervised medical image translation with adversarial diffusion models,” arXiv preprint arXiv:2207.08208, 2022.
  120. Z. Luo, F. K. Gustafsson, Z. Zhao, J. Sjölund, and T. B. Schön, “Refusion: Enabling large-size realistic image restoration with latent-space diffusion models,” in CVPR, pp. 1680–1691, 2023.
  121. Z. Chen, Y. Zhang, D. Liu, B. Xia, J. Gu, L. Kong, and X. Yuan, “Hierarchical integration diffusion model for realistic image deblurring,” arXiv preprint arXiv:2305.12966, 2023.
  122. F. Guth, S. Coste, V. De Bortoli, and S. Mallat, “Wavelet score-based generative modeling,” in NeurIPS, vol. 35, 2022.
  123. J. Huang, Y. Liu, and S. Chen, “Bootstrap diffusion model curve estimation for high resolution low-light image enhancement,” arXiv preprint arXiv:2309.14709, 2023.
  124. J. Wang, Z. Yue, S. Zhou, K. C. Chan, and C. C. Loy, “Exploiting diffusion prior for real-world image super-resolution,” arXiv preprint arXiv:2305.07015, 2023.
  125. X. Lin, J. He, Z. Chen, Z. Lyu, B. Fei, B. Dai, W. Ouyang, Y. Qiao, and C. Dong, “Diffbir: Towards blind image restoration with generative diffusion prior,” arXiv preprint arXiv:2308.15070, 2023.
  126. H. Sun, W. Li, J. Liu, H. Chen, R. Pei, X. Zou, Y. Yan, and Y. Yang, “Coser: Bridging image and language for cognitive super-resolution,” arXiv preprint arXiv:2311.16512, 2023.
  127. F. Yu, J. Gu, Z. Li, J. Hu, X. Kong, X. Wang, J. He, Y. Qiao, and C. Dong, “Scaling up to excellence: Practicing model scaling for photo-realistic image restoration in the wild,” arXiv preprint arXiv:2401.13627, 2024.
  128. J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided diffusion models for inverse problems,” in ICLR, 2022.
  129. J. Schwab, S. Antholzer, and M. Haltmeier, “Deep null space learning for inverse problems: convergence analysis and rates,” Inverse Problems, vol. 35, no. 2, p. 025008, 2019.
  130. Y. Wang, Y. Hu, J. Yu, and J. Zhang, “Gan prior based null-space learning for consistent super-resolution,” in AAAI, vol. 37, pp. 2724–2732, 2023.
  131. G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in CVPR, pp. 2426–2435, 2022.
  132. M. Kwon, J. Jeong, and Y. Uh, “Diffusion models already have a semantic latent space,” in ICLR, 2023.
  133. N. Starodubcev, D. Baranchuk, V. Khrulkov, and A. Babenko, “Towards real-time text-driven image manipulation with unconditional diffusion models,” arXiv preprint arXiv:2304.04344, 2023.
  134. N. Huang, Y. Zhang, F. Tang, C. Ma, H. Huang, W. Dong, and C. Xu, “Diffstyler: Controllable dual diffusion for text-driven image stylization,” IEEE TNNLS, 2024.
  135. Z. Wang, L. Zhao, and W. Xing, “Stylediffusion: Controllable disentangled style transfer via diffusion models,” in ICCV, pp. 7677–7689, 2023.
  136. H. Sasaki, C. G. Willcocks, and T. P. Breckon, “Unit-ddpm: Unpaired image translation with denoising diffusion probabilistic models,” arXiv preprint arXiv:2104.05358, 2021.
  137. S. Xu, Z. Ma, Y. Huang, H. Lee, and J. Chai, “Cyclenet: Rethinking cycle consistency in text-guided diffusion for image manipulation,” in NeurIPS, 2023.
  138. K. Preechakul, N. Chatthee, S. Wizadwongsa, and S. Suwajanakorn, “Diffusion autoencoders: Toward a meaningful and decodable representation,” in CVPR, pp. 10619–10629, 2022.
  139. Z. Lu, C. Wu, X. Chen, Y. Wang, L. Bai, Y. Qiao, and X. Liu, “Hierarchical diffusion autoencoders and disentangled image manipulation,” in WACV, pp. 5374–5383, 2024.
  140. M. Zhao, F. Bao, C. Li, and J. Zhu, “Egsde: Unpaired image-to-image translation via energy-guided stochastic differential equations,” arXiv preprint arXiv:2207.06635, 2022.
  141. N. Matsunaga, M. Ishii, A. Hayakawa, K. Suzuki, and T. Narihira, “Fine-grained image editing by pixel-wise guidance using diffusion models,” arXiv preprint arXiv:2212.02024, 2022.
  142. B. Yang, S. Gu, B. Zhang, T. Zhang, X. Chen, X. Sun, D. Chen, and F. Wen, “Paint by example: Exemplar-based image editing with diffusion models,” in CVPR, pp. 18381–18391, 2023.
  143. K. Kim, S. Park, J. Lee, and J. Choo, “Reference-based image composition with sketch via structure-aware diffusion model,” arXiv preprint arXiv:2304.09748, 2023.
  144. Y. Song, Z. Zhang, Z. Lin, S. Cohen, B. Price, J. Zhang, S. Y. Kim, and D. Aliaga, “Objectstitch: Object compositing with diffusion model,” in CVPR, pp. 18310–18319, 2023.
  145. X. Zhang, J. Guo, P. Yoo, Y. Matsuo, and Y. Iwasawa, “Paste, inpaint and harmonize via denoising: Subject-driven image editing with pre-trained diffusion model,” arXiv preprint arXiv:2306.07596, 2023.
  146. S. Xie, Y. Zhao, Z. Xiao, K. C. Chan, Y. Li, Y. Xu, K. Zhang, and T. Hou, “Dreaminpainter: Text-guided subject-driven image inpainting with diffusion models,” arXiv preprint arXiv:2312.03771, 2023.
  147. X. Chen, L. Huang, Y. Liu, Y. Shen, D. Zhao, and H. Zhao, “Anydoor: Zero-shot object-level image customization,” arXiv preprint arXiv:2307.09481, 2023.
  148. X. Chen and S. Lathuilière, “Face aging via diffusion-based editing,” in BMCV, 2023.
  149. V. Goel, E. Peruzzo, Y. Jiang, D. Xu, N. Sebe, T. Darrell, Z. Wang, and H. Shi, “Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models,” arXiv preprint arXiv:2303.17546, 2023.
  150. S. Xie, Z. Zhang, Z. Lin, T. Hinz, and K. Zhang, “Smartbrush: Text and shape guided object inpainting with diffusion model,” in CVPR, pp. 22428–22437, 2023.
  151. Z. Zhang, J. Zheng, Z. Fang, and B. A. Plummer, “Text-to-image editing by image information removal,” in WACV, pp. 5232–5241, 2024.
  152. J. Zhuang, Y. Zeng, W. Liu, C. Yuan, and K. Chen, “A task is worth one word: Learning with task prompts for high-quality versatile image inpainting,” arXiv preprint arXiv:2312.03594, 2023.
  153. S. Wang, C. Saharia, C. Montgomery, J. Pont-Tuset, S. Noy, S. Pellegrini, Y. Onoe, S. Laszlo, D. J. Fleet, R. Soricut, et al., “Imagen editor and editbench: Advancing and evaluating text-guided image inpainting,” in CVPR, pp. 18359–18369, 2023.
  154. J. Singh, J. Zhang, Q. Liu, C. Smith, Z. Lin, and L. Zheng, “Smartmask: Context aware high-fidelity mask generation for fine-grained object insertion and layout control,” arXiv preprint arXiv:2312.05039, 2023.
  155. S. Yang, X. Chen, and J. Liao, “Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model,” in ACM MM, pp. 3190–3199, 2023.
  156. T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in CVPR, pp. 18392–18402, 2023.
  157. S. Li, C. Chen, and H. Lu, “Moecontroller: Instruction-based arbitrary image manipulation with mixture-of-expert controllers,” arXiv preprint arXiv:2309.04372, 2023.
  158. Q. Guo and T. Lin, “Focus on your instruction: Fine-grained and multi-instruction image editing by attention modulation,” arXiv preprint arXiv:2312.10113, 2023.
  159. T. Chakrabarty, K. Singh, A. Saakyan, and S. Muresan, “Learning to follow object-centric image editing instructions faithfully,” in EMNLP, pp. 9630–9646, 2023.
  160. Z. Geng, B. Yang, T. Hang, C. Li, S. Gu, T. Zhang, J. Bao, Z. Zhang, H. Hu, D. Chen, et al., “Instructdiffusion: A generalist modeling interface for vision tasks,” arXiv preprint arXiv:2309.03895, 2023.
  161. S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman, “Emu edit: Precise image editing via recognition and generation tasks,” arXiv preprint arXiv:2311.10089, 2023.
  162. J. Wei, S. Wu, X. Jiang, and Y. Wang, “Dialogpaint: A dialog-based image editing model,” arXiv preprint arXiv:2303.10073, 2023.
  163. A. B. Yildirim, V. Baday, E. Erdem, A. Erdem, and A. Dundar, “Inst-inpaint: Instructing to remove objects with diffusion models,” arXiv preprint arXiv:2304.03246, 2023.
  164. S. Zhang, X. Yang, Y. Feng, C. Qin, C.-C. Chen, N. Yu, Z. Chen, H. Wang, S. Savarese, S. Ermon, et al., “Hive: Harnessing human feedback for instructional visual editing,” arXiv preprint arXiv:2303.09618, 2023.
  165. S. Yasheng, Y. Yang, H. Peng, Y. Shen, Y. Yang, H. Hu, L. Qiu, and H. Koike, “Imagebrush: Learning visual in-context instructions for exemplar-based image manipulation,” in NeurIPS, 2023.
  166. S. Li, H. Singh, and A. Grover, “Instructany2pix: Flexible visual editing via multimodal instruction following,” arXiv preprint arXiv:2312.06738, 2023.
  167. T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan, “Guiding instruction-based image editing via multimodal large language models,” arXiv preprint arXiv:2309.17102, 2023.
  168. Y. Huang, L. Xie, X. Wang, Z. Yuan, X. Cun, Y. Ge, J. Zhou, C. Dong, R. Huang, R. Zhang, et al., “Smartedit: Exploring complex instruction-based image editing with multimodal large language models,” arXiv preprint arXiv:2312.06739, 2023.
  169. R. Bodur, E. Gundogdu, B. Bhattarai, T.-K. Kim, M. Donoser, and L. Bazzani, “iedit: Localised text-guided image editing with weak supervision,” arXiv preprint arXiv:2305.05947, 2023.
  170. Y. Lin, Y.-W. Chen, Y.-H. Tsai, L. Jiang, and M.-H. Yang, “Text-driven image editing via learnable regions,” arXiv preprint arXiv:2311.16432, 2023.
  171. D. Yue, Q. Guo, M. Ning, J. Cui, Y. Zhu, and L. Yuan, “Chatface: Chat-guided real face editing via diffusion latent space manipulation,” arXiv preprint arXiv:2305.14742, 2023.
  172. D. Valevski, M. Kalman, Y. Matias, and Y. Leviathan, “Unitune: Text-driven image editing by fine tuning an image generation model on a single image,” arXiv preprint arXiv:2210.09477, 2022.
  173. J. Choi, Y. Choi, Y. Kim, J. Kim, and S. Yoon, “Custom-edit: Text-guided image editing with customized diffusion models,” arXiv preprint arXiv:2305.15779, 2023.
  174. J. Huang, Y. Liu, J. Qin, and S. Chen, “Kv inversion: Kv embeddings learning for text-conditioned real image action editing,” arXiv preprint arXiv:2309.16608, 2023.
  175. R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in CVPR, pp. 6038–6047, 2023.
  176. K. Wang, F. Yang, S. Yang, M. A. Butt, and J. van de Weijer, “Dynamic prompt learning: Addressing cross-attention leakage for text-based image editing,” in NeurIPS, 2023.
  177. Q. Wu, Y. Liu, H. Zhao, A. Kale, T. Bui, T. Yu, Z. Lin, Y. Zhang, and S. Chang, “Uncovering the disentanglement capability in text-to-image diffusion models,” in CVPR, pp. 1900–1910, June 2023.
  178. W. Dong, S. Xue, X. Duan, and S. Han, “Prompt tuning inversion for text-driven image editing using diffusion models,” arXiv preprint arXiv:2305.04441, 2023.
  179. S. Li, J. van de Weijer, T. Hu, F. S. Khan, Q. Hou, Y. Wang, and J. Yang, “Stylediffusion: Prompt-embedding inversion for text-based editing,” arXiv preprint arXiv:2303.15649, 2023.
  180. Y. Zhang, N. Huang, F. Tang, H. Huang, C. Ma, W. Dong, and C. Xu, “Inversion-based style transfer with diffusion models,” in CVPR, pp. 10146–10156, June 2023.
  181. C. Mou, X. Wang, J. Song, Y. Shan, and J. Zhang, “Dragondiffusion: Enabling drag-style manipulation on diffusion models,” arXiv preprint arXiv:2307.02421, 2023.
  182. Y. Shi, C. Xue, J. Pan, W. Zhang, V. Y. Tan, and S. Bai, “Dragdiffusion: Harnessing diffusion models for interactive point-based image editing,” arXiv preprint arXiv:2306.14435, 2023.
  183. A. Hertz, K. Aberman, and D. Cohen-Or, “Delta denoising score,” in ICCV, pp. 2328–2337, 2023.
  184. G. Kwon and J. C. Ye, “Diffusion-based image translation using disentangled style and content representation,” in ICLR, 2023.
  185. H. Nam, G. Kwon, G. Y. Park, and J. C. Ye, “Contrastive denoising score for text-guided latent diffusion image editing,” arXiv preprint arXiv:2311.18608, 2023.
  186. S. Yang, L. Zhang, L. Ma, Y. Liu, J. Fu, and Y. He, “Magicremover: Tuning-free text-guided image inpainting with diffusion models,” arXiv preprint arXiv:2310.02848, 2023.
  187. B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in CVPR, pp. 6007–6017, June 2023.
  188. P. Li, Q. Huang, Y. Ding, and Z. Li, “Layerdiffusion: Layered controlled image editing with diffusion models,” arXiv preprint arXiv:2305.18676, 2023.
  189. S. Zhang, S. Xiao, and W. Huang, “Forgedit: Text guided image editing via learning and forgetting,” arXiv preprint arXiv:2309.10556, 2023.
  190. Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” in CVPR, pp. 6027–6037, 2023.
  191. H. Ravi, S. Kelkar, M. Harikumar, and A. Kale, “Preditor: Text guided image editing with diffusion prior,” arXiv preprint arXiv:2302.07979, 2023.
  192. Y. Lin, S. Zhang, X. Yang, X. Wang, and Y. Shi, “Regeneration learning of diffusion models with rich prompts for zero-shot image translation,” arXiv preprint arXiv:2305.04651, 2023.
  193. S. Kim, W. Jang, H. Kim, J. Kim, Y. Choi, S. Kim, and G. Lee, “User-friendly image editing with minimal text input: Leveraging captioning and injection techniques,” arXiv preprint arXiv:2306.02717, 2023.
  194. Q. Wang, B. Zhang, M. Birsak, and P. Wonka, “Instructedit: Improving automatic masks for diffusion-based image editing with user instructions,” arXiv preprint arXiv:2305.18047, 2023.
  195. A. Elarabawy, H. Kamath, and S. Denton, “Direct inversion: Optimization-free text-driven real image editing with diffusion models,” arXiv preprint arXiv:2211.07825, 2022.
  196. I. Huberman-Spiegelglas, V. Kulikov, and T. Michaeli, “An edit friendly ddpm noise space: Inversion and manipulations,” arXiv preprint arXiv:2304.06140, 2023.
  197. S. Nie, H. A. Guo, C. Lu, Y. Zhou, C. Zheng, and C. Li, “The blessing of randomness: Sde beats ode in general diffusion-based image editing,” arXiv preprint arXiv:2311.01410, 2023.
  198. M. Brack, F. Friedrich, K. Kornmeier, L. Tsaban, P. Schramowski, K. Kersting, and A. Passos, “Ledits++: Limitless image editing using text-to-image models,” arXiv preprint arXiv:2311.16711, 2023.
  199. S. Chen and J. Huang, “Fec: Three finetuning-free methods to enhance consistency for real image editing,” arXiv preprint arXiv:2309.14934, 2023.
  200. K. Joseph, P. Udhayanan, T. Shukla, A. Agarwal, S. Karanam, K. Goswami, and B. V. Srinivasan, “Iterative multi-granular image editing using diffusion models,” in WACV, pp. 8107–8116, 2024.
  201. D. Miyake, A. Iohara, Y. Saito, and T. Tanaka, “Negative-prompt inversion: Fast image inversion for editing with text-guided diffusion models,” arXiv preprint arXiv:2305.16807, 2023.
  202. L. Han, S. Wen, Q. Chen, Z. Zhang, K. Song, M. Ren, R. Gao, A. Stathopoulos, X. He, Y. Chen, et al., “Proxedit: Improving tuning-free real image editing with proximal guidance,” in WACV, pp. 4291–4301, 2024.
  203. J. Zhao, H. Zheng, C. Wang, L. Lan, W. Huang, and W. Yang, “Null-text guidance in diffusion models is secretly a cartoon-style creator,” arXiv preprint arXiv:2305.06710, 2023.
  204. B. Wallace, A. Gokul, and N. Naik, “Edict: Exact diffusion inversion via coupled transformations,” in CVPR, pp. 22532–22541, 2023.
  205. Z. Pan, R. Gherardi, X. Xie, and S. Huang, “Effective real image editing with accelerated iterative diffusion inversion,” in ICCV, pp. 15912–15921, 2023.
  206. C. H. Wu and F. De la Torre, “A latent space of stochastic diffusion models for zero-shot image editing and guidance,” in ICCV, pp. 7378–7387, 2023.
  207. J. Jeong, M. Kwon, and Y. Uh, “Training-free content injection using h-space in diffusion models,” in WACV, pp. 5151–5161, 2024.
  208. B. Meiri, D. Samuel, N. Darshan, G. Chechik, S. Avidan, and R. Ben-Ari, “Fixed-point inversion for text-to-image diffusion models,” arXiv preprint arXiv:2312.12540, 2023.
  209. X. Duan, S. Cui, G. Kang, B. Zhang, Z. Fei, M. Fan, and J. Huang, “Tuning-free inversion-enhanced control for consistent image editing,” arXiv preprint arXiv:2312.14611, 2023.
  210. P. Gholami and R. Xiao, “Diffusion brush: A latent diffusion model-based editing tool for ai-generated images,” arXiv preprint arXiv:2306.00219, 2023.
  211. D. Epstein, A. Jabri, B. Poole, A. A. Efros, and A. Holynski, “Diffusion self-guidance for controllable image generation,” arXiv preprint arXiv:2306.00986, 2023.
  212. A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-or, “Prompt-to-prompt image editing with cross-attention control,” in ICLR, 2023.
  213. G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu, “Zero-shot image-to-image translation,” in ACM SIGGRAPH, pp. 1–11, 2023.
  214. N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in CVPR, pp. 1921–1930, 2023.
  215. S. Lu, Y. Liu, and A. W.-K. Kong, “Tf-icon: Diffusion-based training-free cross-domain image composition,” in ICCV, pp. 2294–2305, 2023.
  216. O. Patashnik, D. Garibi, I. Azuri, H. Averbuch-Elor, and D. Cohen-Or, “Localizing object-level shape variations with text-to-image diffusion models,” arXiv preprint arXiv:2303.11306, 2023.
  217. H. Lee, M. Kang, and B. Han, “Conditional score guidance for text-driven image-to-image translation,” arXiv preprint arXiv:2305.18007, 2023.
  218. G. Y. Park, J. Kim, B. Kim, S. W. Lee, and J. C. Ye, “Energy-based cross attention for bayesian context update in text-to-image diffusion models,” arXiv preprint arXiv:2306.09869, 2023.
  219. D. H. Park, G. Luo, C. Toste, S. Azadi, X. Liu, M. Karalashvili, A. Rohrbach, and T. Darrell, “Shape-guided diffusion with inside-outside attention,” in WACV, pp. 4198–4207, 2024.
  220. H. Manukyan, A. Sargsyan, B. Atanyan, Z. Wang, S. Navasardyan, and H. Shi, “Hd-painter: High-resolution and prompt-faithful text-guided image inpainting with diffusion models,” arXiv preprint arXiv:2312.14091, 2023.
  221. Z. Yu, H. Li, F. Fu, X. Miao, and B. Cui, “Fisedit: Accelerating text-to-image editing via cache-enabled sparse diffusion inference,” arXiv preprint arXiv:2305.17423, 2023.
  222. O. Avrahami, O. Fried, and D. Lischinski, “Blended latent diffusion,” ACM TOG, vol. 42, no. 4, pp. 1–11, 2023.
  223. W. Huang, S. Tu, and L. Xu, “Pfb-diff: Progressive feature blending diffusion for text-driven image editing,” arXiv preprint arXiv:2306.16894, 2023.
  224. G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” in ICLR, 2023.
  225. N. Huang, F. Tang, W. Dong, T.-Y. Lee, and C. Xu, “Region-aware diffusion for zero-shot text-driven image editing,” arXiv preprint arXiv:2302.11797, 2023.
  226. Z. Liu, F. Zhang, J. He, J. Wang, Z. Wang, and L. Cheng, “Text-guided mask-free local image retouching,” in ICME, pp. 2783–2788, 2023.
  227. E. Levin and O. Fried, “Differential diffusion: Giving each pixel its strength,” arXiv preprint arXiv:2306.00950, 2023.
  228. A. Mirzaei, T. Aumentado-Armstrong, M. A. Brubaker, J. Kelly, A. Levinshtein, K. G. Derpanis, and I. Gilitschenski, “Watch your steps: Local image and scene editing by text instructions,” arXiv preprint arXiv:2308.08947, 2023.
  229. O. Avrahami, D. Lischinski, and O. Fried, “Blended diffusion for text-driven editing of natural images,” in CVPR, pp. 18208–18218, 2022.
  230. S. Li, B. Zeng, Y. Feng, S. Gao, X. Liu, J. Liu, L. Lin, X. Tang, Y. Hu, J. Liu, et al., “Zone: Zero-shot instruction-guided local editing,” arXiv preprint arXiv:2312.16794, 2023.
  231. T. Yu, R. Feng, R. Feng, J. Liu, X. Jin, W. Zeng, and Z. Chen, “Inpaint anything: Segment anything meets image inpainting,” arXiv preprint arXiv:2304.06790, 2023.
  232. M. Brack, P. Schramowski, F. Friedrich, D. Hintersdorf, and K. Kersting, “The stable artist: Steering semantics in diffusion latent space,” arXiv preprint arXiv:2212.06013, 2022.
  233. M. Brack, F. Friedrich, D. Hintersdorf, L. Struppek, P. Schramowski, and K. Kersting, “Sega: Instructing diffusion using semantic dimensions,” arXiv preprint arXiv:2301.12247, 2023.
  234. L. Tsaban and A. Passos, “Ledits: Real image editing with ddpm inversion and semantic guidance,” arXiv preprint arXiv:2307.00522, 2023.
  235. Z. Yang, D. Gui, W. Wang, H. Chen, B. Zhuang, and C. Shen, “Object-aware inversion and reassembly for image editing,” arXiv preprint arXiv:2310.12149, 2023.
  236. T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” in ICLR, 2018.
  237. Y. Choi, Y. Uh, J. Yoo, and J.-W. Ha, “Stargan v2: Diverse image synthesis for multiple domains,” in CVPR, pp. 8188–8197, 2020.
  238. F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
  239. W. volunteer team, “Wikiart dataset.” https://www.wikiart.org/.
  240. R. Gal, O. Patashnik, H. Maron, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Stylegan-nada: Clip-guided domain adaptation of image generators,” ACM TOG, vol. 41, no. 4, pp. 1–13, 2022.
  241. O. Patashnik, Z. Wu, E. Shechtman, D. Cohen-Or, and D. Lischinski, “Styleclip: Text-driven manipulation of stylegan imagery,” in ICCV, pp. 2085–2094, 2021.
  242. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, et al., “Learning transferable visual models from natural language supervision,” in ICML, pp. 8748–8763, 2021.
  243. Y. Shen, J. Gu, X. Tang, and B. Zhou, “Interpreting the latent space of gans for semantic face editing,” in CVPR, pp. 9243–9252, 2020.
  244. R. Abdal, Y. Qin, and P. Wonka, “Image2stylegan++: How to edit the embedded images?,” in CVPR, pp. 8296–8305, 2020.
  245. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al., “Language models are few-shot learners,” in NeurIPS, vol. 33, pp. 1877–1901, 2020.
  246. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo, P. Dollar, and R. Girshick, “Segment anything,” in ICCV, pp. 4015–4026, 2023.
  247. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou, et al., “Chain-of-thought prompting elicits reasoning in large language models,” in NeurIPS, vol. 35, pp. 24824–24837, 2022.
  248. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in ICML, 2023.
  249. K. Zhang, L. Mo, W. Chen, H. Sun, and Y. Su, “Magicbrush: A manually annotated dataset for instruction-guided image editing,” arXiv preprint arXiv:2306.10012, 2023.
  250. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  251. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
  252. S. Roller, E. Dinan, N. Goyal, D. Ju, M. Williamson, Y. Liu, J. Xu, M. Ott, E. M. Smith, Y.-L. Boureau, et al., “Recipes for building an open-domain chatbot,” in ACL, pp. 300–325, 2021.
  253. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in CVPR, pp. 6700–6709, 2019.
  254. Y. Wu, A. Kirillov, F. Massa, W.-Y. Lo, and R. Girshick, “Detectron2.” https://github.com/facebookresearch/detectron2, 2019.
  255. X. Zhou, R. Girdhar, A. Joulin, P. Krähenbühl, and I. Misra, “Detecting twenty-thousand classes using image-level supervision,” in ECCV, 2022.
  256. Y. Zeng, Z. Lin, H. Lu, and V. M. Patel, “Cr-fill: Generative image inpainting with auxiliary contextual reconstruction,” in ICCV, pp. 14164–14173, 2021.
  257. R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in CVPR, pp. 15180–15190, 2023.
  258. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez, et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023), 2023.
  259. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in NeurIPS, vol. 36, 2023.
  260. T. Lüddecke and A. Ecker, “Image segmentation using text and image prompts,” in CVPR, pp. 7086–7096, 2022.
  261. H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” in ICLR, 2023.
  262. X. Pan, A. Tewari, T. Leimkühler, L. Liu, A. Meka, and C. Theobalt, “Drag your gan: Interactive point-based manipulation on the generative image manifold,” in ACM SIGGRAPH, pp. 1–11, 2023.
  263. J. Li, D. Li, C. Xiong, and S. Hoi, “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in ICML, pp. 12888–12900, 2022.
  264. M. A. Chan, S. I. Young, and C. A. Metzler, “Sud22{}^{2}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPT: Supervision by denoising diffusion models for image reconstruction,” arXiv preprint arXiv:2303.09642, 2023.
  265. S. I. Young, A. V. Dalca, E. Ferrante, P. Golland, C. A. Metzler, B. Fischl, and J. E. Iglesias, “Supervision by denoising,” IEEE TPAMI, 2023.
  266. A. Lugmayr, M. Danelljan, A. Romero, F. Yu, R. Timofte, and L. Van Gool, “Repaint: Inpainting using denoising diffusion probabilistic models,” in CVPR, pp. 11461–11471, 2022.
  267. A. Grechka, G. Couairon, and M. Cord, “Gradpaint: Gradient-guided inpainting with diffusion models,” arXiv preprint arXiv:2309.09614, 2023.
  268. J. Song, A. Vahdat, M. Mardani, and J. Kautz, “Pseudoinverse-guided diffusion models for inverse problems,” in ICLR, 2023.
  269. B. Fei, Z. Lyu, L. Pan, J. Zhang, W. Yang, T. Luo, B. Zhang, and B. Dai, “Generative diffusion prior for unified image restoration and enhancement,” in CVPR, pp. 9935–9946, 2023.
  270. G. Zhang, J. Ji, Y. Zhang, M. Yu, T. Jaakkola, and S. Chang, “Towards coherent image inpainting using denoising diffusion implicit models,” in ICML, pp. 41164–41193, 2023.
  271. Z. Fabian, B. Tinaz, and M. Soltanolkotabi, “Diracdiffusion: Denoising and incremental reconstruction with assured data-consistency,” arXiv preprint arXiv:2303.14353, 2023.
  272. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in CVPR, pp. 4510–4520, 2018.
  273. J. Xu, S. Motamed, P. Vaddamanu, C. H. Wu, C. Haene, J.-C. Bazin, and F. De la Torre, “Personalized face inpainting with diffusion models by parallel visual attention,” in WACV, pp. 5432–5442, 2024.
  274. R. Suvorov, E. Logacheva, A. Mashikhin, A. Remizova, A. Ashukha, A. Silvestrov, N. Kong, H. Goka, K. Park, and V. Lempitsky, “Resolution-robust large mask inpainting with fourier convolutions,” in WACV, pp. 2149–2159, 2022.
  275. S. Basu, M. Saberi, S. Bhardwaj, A. M. Chegini, D. Massiceti, M. Sanjabi, S. X. Hu, and S. Feizi, “Editval: Benchmarking diffusion based text-guided image editing methods,” arXiv preprint arXiv:2310.02426, 2023.
  276. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, pp. 740–755, 2014.
  277. J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo, et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, vol. 2, no. 3, 2023.
  278. J. Hessel, A. Holtzman, M. Forbes, R. Le Bras, and Y. Choi, “Clipscore: A reference-free evaluation metric for image captioning,” in EMNLP, pp. 7514–7528, 2021.
  279. A. Sauer, D. Lorenz, A. Blattmann, and R. Rombach, “Adversarial diffusion distillation,” arXiv preprint arXiv:2311.17042, 2023.
  280. Z. Geng, A. Pokle, and J. Z. Kolter, “One-step diffusion distillation via deep equilibrium models,” vol. 36, 2023.
  281. Y. Song, P. Dhariwal, M. Chen, and I. Sutskever, “Consistency models,” in ICML, 2023.
  282. S. Luo, Y. Tan, L. Huang, J. Li, and H. Zhao, “Latent consistency models: Synthesizing high-resolution images with few-step inference,” arXiv preprint arXiv:2310.04378, 2023.
  283. Y. Zhao, Y. Xu, Z. Xiao, and T. Hou, “Mobilediffusion: Subsecond text-to-image generation on mobile devices,” arXiv preprint arXiv:2311.16567, 2023.
  284. Y. Li, H. Wang, Q. Jin, J. Hu, P. Chemerys, Y. Fu, Y. Wang, S. Tulyakov, and J. Ren, “Snapfusion: Text-to-image diffusion model on mobile devices within two seconds,” in NeurIPS, vol. 36, 2023.
  285. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  286. P. Schramowski, M. Brack, B. Deiseroth, and K. Kersting, “Safe latent diffusion: Mitigating inappropriate degeneration in diffusion models,” in CVPR, pp. 22522–22531, 2023.
  287. R. Pandey, S. O. Escolano, C. Legendre, C. Haene, S. Bouaziz, C. Rhemann, P. Debevec, and S. Fanello, “Total relighting: learning to relight portraits for background replacement,” ACM TOG, vol. 40, no. 4, pp. 1–21, 2021.
  288. P. Ponglertnapakorn, N. Tritrong, and S. Suwajanakorn, “Difareli: Diffusion face relighting,” arXiv preprint arXiv:2304.09479, 2023.
  289. M. Ren, W. Xiong, J. S. Yoon, Z. Shu, J. Zhang, H. Jung, G. Gerig, and H. Zhang, “Relightful harmonization: Lighting-aware portrait background replacement,” arXiv preprint arXiv:2312.06886, 2023.
  290. J. Xiang, J. Yang, B. Huang, and X. Tong, “3d-aware image generation using 2d diffusion models,” arXiv preprint arXiv:2303.17905, 2023.
  291. J. Ackermann and M. Li, “High-resolution image editing via multi-stage blended diffusion,” arXiv preprint arXiv:2210.12965, 2022.
  292. M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter, “Gans trained by a two time-scale update rule converge to a local nash equilibrium,” in NeurIPS, vol. 30, 2017.
  293. M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton, “Demystifying mmd gans,” arXiv preprint arXiv:1801.01401, 2018.
  294. R. Zhang, P. Isola, A. A. Efros, E. Shechtman, and O. Wang, “The unreasonable effectiveness of deep features as a perceptual metric,” in CVPR, pp. 586–595, 2018.
  295. S. Fu, N. Tamir, S. Sundaram, L. Chai, R. Zhang, T. Dekel, and P. Isola, “Dreamsim: Learning new dimensions of human visual similarity using synthetic data,” arXiv preprint arXiv:2306.09344, 2023.
  296. K. Kotar, S. Tian, H.-X. Yu, D. Yamins, and J. Wu, “Are these the same apple? comparing images based on object intrinsics,” in NeurIPS, vol. 36, 2024.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Yi Huang (161 papers)
  2. Jiancheng Huang (22 papers)
  3. Yifan Liu (134 papers)
  4. Mingfu Yan (5 papers)
  5. Jiaxi Lv (5 papers)
  6. Jianzhuang Liu (90 papers)
  7. Wei Xiong (172 papers)
  8. He Zhang (236 papers)
  9. Shifeng Chen (29 papers)
  10. Liangliang Cao (52 papers)
Citations (50)