Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Controllable Generation with Text-to-Image Diffusion Models: A Survey (2403.04279v1)

Published 7 Mar 2024 in cs.CV

Abstract: In the rapidly advancing realm of visual generation, diffusion models have revolutionized the landscape, marking a significant shift in capabilities with their impressive text-guided generative functions. However, relying solely on text for conditioning these models does not fully cater to the varied and complex requirements of different applications and scenarios. Acknowledging this shortfall, a variety of studies aim to control pre-trained text-to-image (T2I) models to support novel conditions. In this survey, we undertake a thorough review of the literature on controllable generation with T2I diffusion models, covering both the theoretical foundations and practical advancements in this domain. Our review begins with a brief introduction to the basics of denoising diffusion probabilistic models (DDPMs) and widely used T2I diffusion models. We then reveal the controlling mechanisms of diffusion models, theoretically analyzing how novel conditions are introduced into the denoising process for conditional generation. Additionally, we offer a detailed overview of research in this area, organizing it into distinct categories from the condition perspective: generation with specific conditions, generation with multiple conditions, and universal controllable generation. For an exhaustive list of the controllable generation literature surveyed, please refer to our curated repository at \url{https://github.com/PRIV-Creation/Awesome-Controllable-T2I-Diffusion-Models}.

Overview of "Controllable Generation with Text-to-Image Diffusion Models: A Survey"

The paper "Controllable Generation with Text-to-Image Diffusion Models: A Survey" addresses the evolving landscape of text-guided visual generation through diffusion models. Recognizing the limitations of relying solely on text conditions, the authors present a comprehensive review of literature focusing on control mechanisms that accommodate novel conditions beyond text prompts.

Theoretical Foundations

The survey begins with an introduction to denoising diffusion probabilistic models (DDPMs) and their foundational role in generating high-quality images from noise. The text-to-image diffusion models discussed include GLIDE, Imagen, DALL·E 2, Latent Diffusion Models (LDM), and Stable Diffusion, each characterized by distinct architectures and training datasets. These models form the basis for exploring various condition-control mechanisms within diffusion models.

Mechanisms for Controllable Generation

Controlling text-to-image models with novel conditions is a central theme, and the authors outline two primary mechanisms: conditional score prediction and condition-guided score estimation.

  1. Conditional Score Prediction involves incorporating new conditions directly into the generative model, either through model-based, tuning-based, or training-free approaches. Each method integrates novel conditions to steer generation effectively within the sampling process.
  2. Condition-Guided Score Estimation leverages additional models to estimate conditional scores from latent features, enhancing the generation process without the need for classifier-free guidance (CFG).

Categorization of Conditional Generation Methods

The survey categorizes approaches into specific applications based on novel conditions:

  1. Personalization: Focuses on subject, person, style, interaction, image, and distribution-driven generation methods. By tuning model parameters or using model-based score prediction, personalized outputs reflect unique subjects or styles from reference images.
  2. Spatial Control: Explores methods using spatial conditions like layouts and masks to achieve structure-driven generation, utilizing both conditional score prediction and guided score estimation.
  3. Advanced Text-Conditioned Generation: Tackles challenges of textual alignment and multilingual generation by refining attention mechanisms or integrating multilingual models.
  4. In-Context and Brain/Sound-Guided Generation: Extends beyond visual cues, incorporating contextual understanding and brain/sound signals for generation.
  5. Text Rendering: Enhances the capability of models to generate visually coherent text within images, leveraging text encoders and training adjustments.

Frameworks for Multiple Conditions and Universal Control

For scenarios requiring the integration of multiple conditions, the paper reviews methods utilizing joint training, continual learning, weight fusion, attention-based integration, and guidance composition. The authors also discuss frameworks for universal control, proposing approaches that accommodate varied conditions through generalized score prediction or condition-guided estimation.

Implications and Future Directions

This survey underscores the transformative potential of controllable generation techniques, not only enhancing image synthesis capabilities but also broadening applications in personalization, image manipulation, and 3D reconstruction. The exploration suggests a future where adaptable and responsive generative models can seamlessly align with multifaceted user requirements across diverse domains.

Overall, the paper provides a structured and detailed examination of the state-of-the-art in controllable text-to-image generation, offering valuable insights and a foundation for future advancements in artificial intelligence-driven content creation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (249)
  1. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” Advances in neural information processing systems, vol. 27, 2014.
  2. M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial networks,” in International conference on machine learning.   PMLR, 2017, pp. 214–223.
  3. T. Karras, T. Aila, S. Laine, and J. Lehtinen, “Progressive growing of gans for improved quality, stability, and variation,” arXiv preprint arXiv:1710.10196, 2017.
  4. T. Karras, S. Laine, and T. Aila, “A style-based generator architecture for generative adversarial networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4401–4410.
  5. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of stylegan,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 8110–8119.
  6. T. Karras, M. Aittala, S. Laine, E. Härkönen, J. Hellsten, J. Lehtinen, and T. Aila, “Alias-free generative adversarial networks,” Advances in Neural Information Processing Systems, vol. 34, pp. 852–863, 2021.
  7. P. Cao, L. Yang, D. Liu, Z. Liu, S. Li, and Q. Song, “Lsap: Rethinking inversion fidelity, perception and editability in gan latent space,” arXiv preprint arXiv:2209.12746, 2022.
  8. ——, “What decreases editing capability? domain-specific hybrid refinement for improved gan inversion,” arXiv preprint arXiv:2301.12141, 2023.
  9. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, pp. 32–73, 2017.
  10. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  11. B. Thomee, D. A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, and L.-J. Li, “Yfcc100m: The new data in multimedia research,” Communications of the ACM, vol. 59, no. 2, pp. 64–73, 2016.
  12. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
  13. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568.
  14. K. Desai, G. Kaul, Z. Aysola, and J. Johnson, “Redcaps: Web-curated image-text data created by the people, for the people,” arXiv preprint arXiv:2111.11431, 2021.
  15. K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork, “Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2443–2449.
  16. C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021.
  17. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  18. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Advances in neural information processing systems, vol. 34, pp. 8780–8794, 2021.
  19. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=PxTIG12RRHS
  20. J. Ho and T. Salimans, “Classifier-free diffusion guidance,” arXiv preprint arXiv:2207.12598, 2022.
  21. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  22. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  23. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  24. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  25. D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. Müller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,” arXiv preprint arXiv:2307.01952, 2023.
  26. Y. Balaji, S. Nah, X. Huang, A. Vahdat, J. Song, K. Kreis, M. Aittala, T. Aila, S. Laine, B. Catanzaro et al., “ediffi: Text-to-image diffusion models with an ensemble of expert denoisers,” arXiv preprint arXiv:2211.01324, 2022.
  27. L. Chen, M. Zhao, Y. Liu, M. Ding, Y. Song, S. Wang, X. Wang, H. Yang, J. Liu, K. Du et al., “Photoverse: Tuning-free image customization with text-to-image diffusion models,” arXiv preprint arXiv:2309.05793, 2023.
  28. L. Yang, Z. Zhang, Y. Song, S. Hong, R. Xu, Y. Zhao, W. Zhang, B. Cui, and M.-H. Yang, “Diffusion models: A comprehensive survey of methods and applications,” ACM Computing Surveys, vol. 56, no. 4, pp. 1–39, 2023.
  29. A. Ulhaq, N. Akhtar, and G. Pogrebna, “Efficient diffusion models for vision: A survey,” arXiv preprint arXiv:2210.09292, 2022.
  30. F. Zhan, Y. Yu, R. Wu, J. Zhang, S. Lu, L. Liu, A. Kortylewski, C. Theobalt, and E. Xing, “Multimodal image synthesis and editing: A survey and taxonomy,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  31. F.-A. Croitoru, V. Hondru, R. T. Ionescu, and M. Shah, “Diffusion models in vision: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  32. C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion model in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.
  33. Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “A survey on video diffusion models,” arXiv preprint arXiv:2310.10647, 2023.
  34. Anonymous, “Video diffusion models - a survey,” Submitted to Transactions on Machine Learning Research, 2023, under review. [Online]. Available: https://openreview.net/forum?id=sgDFqNTdaN
  35. C. Li, C. Zhang, A. Waghwase, L.-H. Lee, F. Rameau, Y. Yang, S.-H. Bae, and C. S. Hong, “Generative ai meets 3d: A survey on text-to-3d in aigc era,” arXiv preprint arXiv:2305.06131, 2023.
  36. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  37. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
  38. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, 2022.
  39. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  40. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  41. R. Gal, Y. Alaluf, Y. Atzmon, O. Patashnik, A. H. Bermano, G. Chechik, and D. Cohen-Or, “An image is worth one word: Personalizing text-to-image generation using textual inversion,” arXiv preprint arXiv:2208.01618, 2022.
  42. N. Ruiz, Y. Li, V. Jampani, Y. Pritch, M. Rubinstein, and K. Aberman, “Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 500–22 510.
  43. W. Chen, H. Hu, C. Saharia, and W. W. Cohen, “Re-imagen: Retrieval-augmented text-to-image generator,” arXiv preprint arXiv:2209.14491, 2022.
  44. Z. Dong, P. Wei, and L. Lin, “Dreamartist: Towards controllable one-shot text-to-image generation via contrastive prompt-tuning,” arXiv preprint arXiv:2211.11337, 2022.
  45. N. Kumari, B. Zhang, R. Zhang, E. Shechtman, and J.-Y. Zhu, “Multi-concept customization of text-to-image diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1931–1941.
  46. A. Voronov, M. Khoroshikh, A. Babenko, and M. Ryabinin, “Is this loss informative? faster text-to-image customization by tracking objective dynamics,” in Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  47. R. Gal, M. Arar, Y. Atzmon, A. H. Bermano, G. Chechik, and D. Cohen-Or, “Designing an encoder for fast personalization of text-to-image models,” arXiv preprint arXiv:2302.12228, 2023.
  48. Y. Wei, Y. Zhang, Z. Ji, J. Bai, L. Zhang, and W. Zuo, “Elite: Encoding visual concepts into textual embeddings for customized text-to-image generation,” arXiv preprint arXiv:2302.13848, 2023.
  49. Y. Ma, H. Yang, W. Wang, J. Fu, and J. Liu, “Unified multi-modal latent diffusion for joint subject and text conditional image generation,” arXiv preprint arXiv:2303.09319, 2023.
  50. A. Voynov, Q. Chu, D. Cohen-Or, and K. Aberman, “p+limit-from𝑝p+italic_p +: Extended textual conditioning in text-to-image generation,” arXiv preprint arXiv:2303.09522, 2023.
  51. L. Han, Y. Li, H. Zhang, P. Milanfar, D. Metaxas, and F. Yang, “Svdiff: Compact parameter space for diffusion fine-tuning,” arXiv preprint arXiv:2303.11305, 2023.
  52. C. Xiang, F. Bao, C. Li, H. Su, and J. Zhu, “A closer look at parameter-efficient tuning in diffusion models,” arXiv preprint arXiv:2303.18181, 2023.
  53. W. Chen, H. Hu, Y. Li, N. Rui, X. Jia, M.-W. Chang, and W. W. Cohen, “Subject-driven text-to-image generation via apprenticeship learning,” arXiv preprint arXiv:2304.00186, 2023.
  54. X. Jia, Y. Zhao, K. C. Chan, Y. Li, H. Zhang, B. Gong, T. Hou, H. Wang, and Y.-C. Su, “Taming encoder for zero fine-tuning image customization with text-to-image diffusion models,” arXiv preprint arXiv:2304.02642, 2023.
  55. J. Shi, W. Xiong, Z. Lin, and H. J. Jung, “Instantbooth: Personalized text-to-image generation without test-time finetuning,” arXiv preprint arXiv:2304.03411, 2023.
  56. J. Yang, H. Wang, R. Xiao, S. Wu, G. Chen, and J. Zhao, “Controllable textual inversion for personalized text-to-image generation,” arXiv preprint arXiv:2304.05265, 2023.
  57. Z. Fei, M. Fan, and J. Huang, “Gradient-free textual inversion,” arXiv preprint arXiv:2304.05818, 2023.
  58. Y. Tewel, R. Gal, G. Chechik, and Y. Atzmon, “Key-locked rank one editing for text-to-image personalization,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
  59. H. Chen, Y. Zhang, X. Wang, X. Duan, Y. Zhou, and W. Zhu, “Disenbooth: Disentangled parameter-efficient tuning for subject-driven text-to-image generation,” arXiv preprint arXiv:2305.03374, 2023.
  60. D. Li, J. Li, and S. C. Hoi, “Blip-diffusion: Pre-trained subject representation for controllable text-to-image generation and editing,” arXiv preprint arXiv:2305.14720, 2023.
  61. Y. Zhang, W. Dong, F. Tang, N. Huang, H. Huang, C. Ma, T.-Y. Lee, O. Deussen, and C. Xu, “Prospect: Prompt spectrum for attribute-aware personalization of diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 6, pp. 1–14, 2023.
  62. O. Avrahami, K. Aberman, O. Fried, D. Cohen-Or, and D. Lischinski, “Break-a-scene: Extracting multiple concepts from a single image,” arXiv preprint arXiv:2305.16311, 2023.
  63. J. Xiao, M. Yin, Y. Gong, X. Zang, J. Ren, and B. Yuan, “Comcat: Towards efficient compression and customization of attention-based vision models,” arXiv preprint arXiv:2305.17235, 2023.
  64. Z. Qiu, W. Liu, H. Feng, Y. Xue, Y. Feng, Z. Liu, D. Zhang, A. Weller, and B. Schölkopf, “Controlling text-to-image diffusion by orthogonal finetuning,” arXiv preprint arXiv:2306.07280, 2023.
  65. Y. Li, H. Liu, Y. Wen, and Y. J. Lee, “Generate anything anywhere in any scene,” arXiv preprint arXiv:2306.17154, 2023.
  66. M. Arar, R. Gal, Y. Atzmon, G. Chechik, D. Cohen-Or, A. Shamir, and A. H. Bermano, “Domain-agnostic tuning-encoder for fast personalization of text-to-image models,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10.
  67. J. Ma, J. Liang, C. Chen, and H. Lu, “Subject-diffusion: Open domain personalized text-to-image generation without test-time fine-tuning,” arXiv preprint arXiv:2307.11410, 2023.
  68. S.-Y. Yeh, Y.-G. Hsieh, Z. Gao, B. B. Yang, G. Oh, and Y. Gong, “Navigating text-to-image customization: From lycoris fine-tuning to model evaluation,” arXiv preprint arXiv:2309.14859, 2023.
  69. X. Pan, L. Dong, S. Huang, Z. Peng, W. Chen, and F. Wei, “Kosmos-g: Generating images in context with multimodal large language models,” arXiv preprint arXiv:2310.02992, 2023.
  70. C. Jin, R. Tanno, A. Saseendran, T. Diethe, and P. Teare, “An image is worth multiple words: Learning object level concepts using multi-concept prompt learning,” arXiv preprint arXiv:2310.12274, 2023.
  71. X. He, Z. Cao, N. Kolkin, L. Yu, H. Rhodin, and R. Kalarot, “A data perspective on enhanced identity preservation for diffusion personalization,” arXiv preprint arXiv:2311.04315, 2023.
  72. A. Roy, M. Suin, A. Shah, K. Shah, J. Liu, and R. Chellappa, “Diffnat: Improving diffusion image quality using natural image statistics,” arXiv preprint arXiv:2311.09753, 2023.
  73. A. Agarwal, S. Karanam, T. Shukla, and B. V. Srinivasan, “An image is worth multiple words: Multi-attribute inversion for constrained text-to-image synthesis,” arXiv preprint arXiv:2311.11919, 2023.
  74. S. Motamed, D. P. Paudel, and L. Van Gool, “Lego: Learning to disentangle and invert concepts beyond object appearance in text-to-image diffusion models,” arXiv preprint arXiv:2311.13833, 2023.
  75. R. Zhao, M. Zhu, S. Dong, N. Wang, and X. Gao, “Catversion: Concatenating embeddings for diffusion-based text-to-image personalization,” arXiv preprint arXiv:2311.14631, 2023.
  76. M. Safaee, A. Mikaeili, O. Patashnik, D. Cohen-Or, and A. Mahdavi-Amiri, “Clic: Concept learning in context,” arXiv preprint arXiv:2311.17083, 2023.
  77. H. Zhao, T. Lu, J. Gu, X. Zhang, Z. Wu, H. Xu, and Y.-G. Jiang, “Videoassembler: Identity-consistent video generation with reference entities using diffusion model,” arXiv preprint arXiv:2311.17338, 2023.
  78. Z. Wang, W. Wei, Y. Zhao, Z. Xiao, M. Hasegawa-Johnson, H. Shi, and T. Hou, “Hifi tuner: High-fidelity subject-driven fine-tuning for diffusion models,” arXiv preprint arXiv:2312.00079, 2023.
  79. Y. Jiang, T. Wu, S. Yang, C. Si, D. Lin, Y. Qiao, C. C. Loy, and Z. Liu, “Videobooth: Diffusion-based video generation with image prompts,” arXiv preprint arXiv:2312.00777, 2023.
  80. Y. Zhou, R. Zhang, J. Gu, and T. Sun, “Customization assistant for text-to-image generation,” arXiv preprint arXiv:2312.03045, 2023.
  81. Y. Cai, Y. Wei, Z. Ji, J. Bai, H. Han, and W. Zuo, “Decoupled textual embeddings for customized image generation,” arXiv preprint arXiv:2312.11826, 2023.
  82. M. Hua, J. Liu, F. Ding, W. Liu, J. Wu, and Q. He, “Dreamtuner: Single image is enough for subject-driven generation,” 2023.
  83. G. Xiao, T. Yin, W. T. Freeman, F. Durand, and S. Han, “Fastcomposer: Tuning-free multi-subject image generation with localized attention,” arXiv preprint arXiv:2305.10431, 2023.
  84. N. Giambi and G. Lisanti, “Conditioning diffusion models via attributes and semantic masks for face generation,” arXiv preprint arXiv:2306.00914, 2023.
  85. D. Valevski, D. Lumen, Y. Matias, and Y. Leviathan, “Face0: Instantaneously conditioning a text-to-image model on a face,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10.
  86. Z. Chen, S. Fang, W. Liu, Q. He, M. Huang, Y. Zhang, and Z. Mao, “Dreamidentity: Improved editability for efficient face-identity preserved image generation,” arXiv preprint arXiv:2307.00300, 2023.
  87. N. Ruiz, Y. Li, V. Jampani, W. Wei, T. Hou, Y. Pritch, N. Wadhwa, M. Rubinstein, and K. Aberman, “Hyperdreambooth: Hypernetworks for fast personalization of text-to-image models,” arXiv preprint arXiv:2307.06949, 2023.
  88. J. Hyung, J. Shin, and J. Choo, “Magicapture: High-resolution multi-concept portrait customization,” arXiv preprint arXiv:2309.06895, 2023.
  89. Y. Wang, W. Zhang, J. Zheng, and C. Jin, “High-fidelity person-centric subject-to-image synthesis,” arXiv preprint arXiv:2311.10329, 2023.
  90. X. Li, X. Hou, and C. C. Loy, “When stylegan meets stable diffusion: a 𝒲+subscript𝒲\mathscr{W}_{+}script_W start_POSTSUBSCRIPT + end_POSTSUBSCRIPT adapter for personalized image generation,” arXiv preprint arXiv:2311.17461, 2023.
  91. H. Tang, X. Zhou, J. Deng, Z. Pan, H. Tian, and P. Chaudhari, “Retrieving conditions from reference images for diffusion models,” arXiv preprint arXiv:2312.02521, 2023.
  92. Y. Yan, C. Zhang, R. Wang, Y. Zhou, G. Zhang, P. Cheng, G. Yu, and B. Fu, “Facestudio: Put your face everywhere in seconds,” arXiv preprint arXiv:2312.02663, 2023.
  93. S. Y. Cheong, A. Mustafa, and A. Gilbert, “Visconet: Bridging and harmonizing visual and textual conditioning for controlnet,” arXiv preprint arXiv:2312.03154, 2023.
  94. D.-Y. Chen, S. Koley, A. Sain, P. N. Chowdhury, T. Xiang, A. K. Bhunia, and Y.-Z. Song, “Democaricature: Democratising caricature generation with a rough sketch,” arXiv preprint arXiv:2312.04364, 2023.
  95. Z. Li, M. Cao, X. Wang, Z. Qi, M.-M. Cheng, and Y. Shan, “Photomaker: Customizing realistic human photos via stacked id embedding,” arXiv preprint arXiv:2312.04461, 2023.
  96. P. Achlioptas, A. Benetatos, I. Fostiropoulos, and D. Skourtis, “Stellar: Systematic evaluation of human-centric personalized text-to-image methods,” arXiv preprint arXiv:2312.06116, 2023.
  97. X. Peng, J. Zhu, B. Jiang, Y. Tai, D. Luo, J. Zhang, W. Lin, T. Jin, C. Wang, and R. Ji, “Portraitbooth: A versatile portrait model for fast identity-preserved personalization,” arXiv preprint arXiv:2312.06354, 2023.
  98. K. Sohn, N. Ruiz, K. Lee, D. C. Chin, I. Blok, H. Chang, J. Barber, L. Jiang, G. Entis, Y. Li et al., “Styledrop: Text-to-image generation in any style,” arXiv preprint arXiv:2306.00983, 2023.
  99. G. Liu, M. Xia, Y. Zhang, H. Chen, J. Xing, X. Wang, Y. Yang, and Y. Shan, “Stylecrafter: Enhancing stylized text-to-video generation with style adapter,” arXiv preprint arXiv:2312.00330, 2023.
  100. D.-Y. Chen, H. Tennent, and C.-W. Hsu, “Artadapter: Text-to-image style transfer using multi-level style encoder and explicit adaptation,” arXiv preprint arXiv:2312.02109, 2023.
  101. A. Hertz, A. Voynov, S. Fruchter, and D. Cohen-Or, “Style aligned image generation via shared attention,” arXiv preprint arXiv:2312.02133, 2023.
  102. J. Pan, H. Yan, J. H. Liew, J. Feng, and V. Y. Tan, “Towards accurate guided diffusion sampling through symplectic adjoint method,” arXiv preprint arXiv:2312.12030, 2023.
  103. Z. Huang, T. Wu, Y. Jiang, K. C. Chan, and Z. Liu, “Reversion: Diffusion-based relation inversion from images,” arXiv preprint arXiv:2303.13495, 2023.
  104. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  105. R. Zhao, Y. Gu, J. Z. Wu, D. J. Zhang, J. Liu, W. Wu, J. Keppo, and M. Z. Shou, “Motiondirector: Motion customization of text-to-video diffusion models,” arXiv preprint arXiv:2310.08465, 2023.
  106. R. Wu, L. Chen, T. Yang, C. Guo, C. Li, and X. Zhang, “Lamp: Learn a motion pattern for few-shot-based video generation,” arXiv preprint arXiv:2310.10769, 2023.
  107. Y. Song, W. Shin, J. Lee, J. Kim, and N. Kwak, “Save: Protagonist diversification with structure agnostic video editing,” arXiv preprint arXiv:2312.02503, 2023.
  108. J. Materzynska, J. Sivic, E. Shechtman, A. Torralba, R. Zhang, and B. Russell, “Customizing motion in text-to-video diffusion models,” arXiv preprint arXiv:2312.04966, 2023.
  109. M. Feng, J. Liu, K. Yu, Y. Yao, Z. Hui, X. Guo, X. Lin, H. Xue, C. Shi, X. Li et al., “Dreamoving: A human dance video generation framework based on diffusion models,” arXiv preprint arXiv:2312.05107, 2023.
  110. Y. Zhang, F. Tang, N. Huang, H. Huang, C. Ma, W. Dong, and C. Xu, “Motioncrafter: One-shot motion customization of diffusion models,” arXiv preprint arXiv:2312.05288, 2023.
  111. J. Tian Hoe, X. Jiang, C. S. Chan, Y.-P. Tan, and W. Hu, “Interactdiffusion: Interaction control in text-to-image diffusion models,” arXiv e-prints, pp. arXiv–2312, 2023.
  112. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
  113. X. Xu, Z. Wang, G. Zhang, K. Wang, and H. Shi, “Versatile diffusion: Text, images and variations all in one diffusion model,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7754–7765.
  114. X. Xu, J. Guo, Z. Wang, G. Huang, I. Essa, and H. Shi, “Prompt-free diffusion: Taking” text” out of text-to-image diffusion models,” arXiv preprint arXiv:2305.16223, 2023.
  115. S. Zhao, D. Chen, Y.-C. Chen, J. Bao, S. Hao, L. Yuan, and K.-Y. K. Wong, “Uni-controlnet: All-in-one control to text-to-image diffusion models,” arXiv preprint arXiv:2305.16322, 2023.
  116. H. Ye, J. Zhang, S. Liu, X. Han, and W. Yang, “Ip-adapter: Text compatible image prompt adapter for text-to-image diffusion models,” arXiv preprint arXiv:2308.06721, 2023.
  117. I. Najdenkoska, A. Sinha, A. Dubey, D. Mahajan, V. Ramanathan, and F. Radenovic, “Context diffusion: In-context aware image generation,” arXiv preprint arXiv:2312.03584, 2023.
  118. S. Mo, F. Mu, K. H. Lin, Y. Liu, B. Guan, Y. Li, and B. Zhou, “Freecontrol: Training-free spatial control of any text-to-image diffusion model with any condition,” arXiv preprint arXiv:2312.07536, 2023.
  119. P. Cao, L. Yang, F. Zhou, T. Huang, and Q. Song, “Concept-centric personalization with large-scale diffusion priors,” arXiv preprint arXiv:2312.08195, 2023.
  120. B. Nlong Zhao, Y. Xiao, J. Xu, X. Jiang, Y. Yang, D. Li, L. Itti, V. Vineet, and Y. Ge, “Dreamdistribution: Prompt distribution learning for text-to-image diffusion models,” arXiv e-prints, pp. arXiv–2312, 2023.
  121. A. Voynov, K. Aberman, and D. Cohen-Or, “Sketch-guided text-to-image diffusion models,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
  122. O. Avrahami, T. Hayes, O. Gafni, S. Gupta, Y. Taigman, D. Parikh, D. Lischinski, O. Fried, and X. Yin, “Spatext: Spatio-textual representation for controllable image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 370–18 380.
  123. Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 511–22 521.
  124. A. Bansal, H.-M. Chu, A. Schwarzschild, S. Sengupta, M. Goldblum, J. Geiping, and T. Goldstein, “Universal guidance for diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 843–852.
  125. J. Cheng, X. Liang, X. Shi, T. He, T. Xiao, and M. Li, “Layoutdiffuse: Adapting foundational diffusion models for layout-to-image generation,” arXiv preprint arXiv:2302.08908, 2023.
  126. C. Ham, J. Hays, J. Lu, K. K. Singh, Z. Zhang, and T. Hinz, “Modulating pretrained diffusion models for multimodal image synthesis,” arXiv preprint arXiv:2302.12764, 2023.
  127. J. Yu, Y. Wang, C. Zhao, B. Ghanem, and J. Zhang, “Freedom: Training-free energy-guided conditional diffusion model,” arXiv preprint arXiv:2303.09833, 2023.
  128. H. Xue, Z. Huang, Q. Sun, L. Song, and W. Zhang, “Freestyle layout-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 256–14 266.
  129. G. Zheng, X. Zhou, X. Li, Z. Qi, Y. Shan, and X. Li, “Layoutdiffusion: Controllable diffusion model for layout-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 490–22 499.
  130. X. Ju, A. Zeng, C. Zhao, J. Wang, L. Zhang, and Q. Xu, “Humansd: A native skeleton-guided diffusion model for human image generation,” arXiv preprint arXiv:2304.04269, 2023.
  131. C. Liu and D. Liu, “Late-constraint diffusion guidance for controllable image synthesis,” arXiv preprint arXiv:2305.11520, 2023.
  132. K. Chen, E. Xie, Z. Chen, L. Hong, Z. Li, and D.-Y. Yeung, “Integrating geometric control into text-to-image diffusion models for high-quality detection data generation via text prompt,” arXiv preprint arXiv:2306.04607, 2023.
  133. Q. Phung, S. Ge, and J.-B. Huang, “Grounded text-to-image synthesis with attention refocusing,” arXiv preprint arXiv:2306.05427, 2023.
  134. G. Couairon, M. Careil, M. Cord, S. Lathuiliere, and J. Verbeek, “Zero-shot spatial layout conditioning for text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2174–2183.
  135. Y. He, R. Salakhutdinov, and J. Z. Kolter, “Localized text-to-image generation for free via cross attention control,” arXiv preprint arXiv:2306.14636, 2023.
  136. C. Jia, M. Luo, Z. Dang, G. Dai, X. Chang, M. Wang, and J. Wang, “Ssmg: Spatial-semantic map guided diffusion model for free-form layout-to-image generation,” arXiv preprint arXiv:2308.10156, 2023.
  137. Y. Kim, J. Lee, J.-H. Kim, J.-W. Ha, and J.-Y. Zhu, “Dense text-to-image generation with attention modulation,” ArXiv, vol. abs/2308.12964, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:261101003
  138. J. Zhang, S. Li, Y. Lu, T. Fang, D. McKinnon, Y. Tsin, L. Quan, and Y. Yao, “Jointnet: Extending text-to-image diffusion for dense distribution modeling,” arXiv preprint arXiv:2310.06347, 2023.
  139. X. Liu, J. Ren, A. Siarohin, I. Skorokhodov, Y. Li, D. Lin, X. Liu, Z. Liu, and S. Tulyakov, “Hyperhuman: Hyper-realistic human generation with latent structural diffusion,” arXiv preprint arXiv:2310.08579, 2023.
  140. C. F. Dormann, J. M. McPherson, M. B. Araújo, R. Bivand, J. Bolliger, G. Carl, R. G. Davies, A. Hirzel, W. Jetz, W. Daniel Kissling et al., “Methods to account for spatial autocorrelation in the analysis of species distributional data: a review,” Ecography, vol. 30, no. 5, pp. 609–628, 2007.
  141. Y. Wang, W. Zhang, J. Zheng, and C. Jin, “Enhancing object coherence in layout-to-image synthesis,” arXiv preprint arXiv:2311.10522, 2023.
  142. P. Zhao, H. Li, R. Jin, and S. K. Zhou, “Loco: Locally constrained training-free layout-to-image synthesis,” arXiv preprint arXiv:2311.12342, 2023.
  143. A. Voynov, A. Hertz, M. Arar, S. Fruchter, and D. Cohen-Or, “Anylens: A generative diffusion model with any rendering lens,” arXiv preprint arXiv:2311.17609, 2023.
  144. Z. Qi, G. Huang, Z. Huang, Q. Guo, J. Chen, J. Han, J. Wang, G. Zhang, L. Liu, E. Ding et al., “Layered rendering diffusion model for zero-shot guided image synthesis,” arXiv preprint arXiv:2311.18435, 2023.
  145. S. F. Bhat, N. J. Mitra, and P. Wonka, “Loosecontrol: Lifting controlnet for generalized depth conditioning,” arXiv preprint arXiv:2312.03079, 2023.
  146. Y. Zhao, L. Peng, Y. Yang, Z. Luo, H. Li, Y. Chen, W. Zhao, W. Liu, B. Wu et al., “Local conditional controlling for text-to-image diffusion models,” arXiv preprint arXiv:2312.08768, 2023.
  147. Z. Jiang, C. Mao, Y. Pan, Z. Han, and J. Zhang, “Scedit: Efficient and controllable image diffusion generation via skip connection editing,” arXiv preprint arXiv:2312.11392, 2023.
  148. J. Ren, C. Xu, H. Chen, X. Qin, C. Li, and L. Zhu, “Towards flexible, scalable, and adaptive multi-modal conditioned face synthesis,” arXiv preprint arXiv:2312.16274, 2023.
  149. W. Feng, X. He, T.-J. Fu, V. Jampani, A. R. Akula, P. Narayana, S. Basu, X. E. Wang, and W. Y. Wang, “Training-free structured diffusion guidance for compositional text-to-image synthesis,” ArXiv, 2022.
  150. H. Chefer, Y. Alaluf, Y. Vinker, L. Wolf, and D. Cohen-Or, “Attend-and-excite: Attention-based semantic guidance for text-to-image diffusion models,” ACM Transactions on Graphics (TOG), vol. 42, no. 4, pp. 1–10, 2023.
  151. C. Qin, N. Yu, C. Xing, S. Zhang, Z. Chen, S. Ermon, Y. Fu, C. Xiong, and R. Xu, “Gluegen: Plug and play multi-modal encoders for x-to-image generation,” arXiv preprint arXiv:2303.10056, 2023.
  152. S. Ge, T. Park, J.-Y. Zhu, and J.-B. Huang, “Expressive text-to-image generation with rich text,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7545–7556.
  153. R. Rassin, E. Hirsch, D. Glickman, S. Ravfogel, Y. Goldberg, and G. Chechik, “Linguistic binding in diffusion models: Enhancing attribute correspondence through attention map alignment,” arXiv preprint arXiv:2306.08877, 2023.
  154. Z. Chen, L. Zhang, F. Weng, L. Pan, and Z. Lan, “Tailored visions: Enhancing text-to-image generation with personalized prompt rewriting,” arXiv preprint arXiv:2310.08129, 2023.
  155. W. Wu, Z. Li, Y. He, M. Z. Shou, C. Shen, L. Cheng, Y. Li, T. Gao, D. Zhang, and Z. Wang, “Paragraph-to-image generation with information-enriched diffusion model,” arXiv preprint arXiv:2311.14284, 2023.
  156. J. Ma, C. Chen, Q. Xie, and H. Lu, “Pea-diffusion: Parameter-efficient adapter with knowledge distillation in non-english text-to-image generation,” arXiv preprint arXiv:2311.17086, 2023.
  157. Z. Wang, Y. Jiang, Y. Lu, Y. Shen, P. He, W. Chen, Z. Wang, and M. Zhou, “In-context learning unlocked for diffusion models,” arXiv preprint arXiv:2305.01115, 2023.
  158. T. Chen, Y. Liu, Z. Wang, J. Yuan, Q. You, H. Yang, and M. Zhou, “Improving in-context learning in diffusion models with visual context-modulated prompts,” arXiv preprint arXiv:2312.01408, 2023.
  159. Z. Chen, J. Qing, T. Xiang, W. L. Yue, and J. H. Zhou, “Seeing beyond the brain: Conditional diffusion model with sparse masked modeling for vision decoding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 710–22 720.
  160. Y. Takagi and S. Nishimoto, “High-resolution image reconstruction with latent diffusion models from human brain activity,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 453–14 463.
  161. F. Ozcelik and R. VanRullen, “Natural scene reconstruction from fmri signals using generative latent diffusion,” Scientific Reports, vol. 13, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:260439960
  162. Y. Lu, C. Du, Q. Zhou, D. Wang, and H. He, “Minddiffuser: Controlled image reconstruction from human brain activity with semantic and structural diffusion,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5899–5908.
  163. P. Ni and Y. Zhang, “Natural image reconstruction from fmri based on self-supervised representation learning and latent diffusion model,” in Proceedings of the 15th International Conference on Digital Image Processing, 2023, pp. 1–9.
  164. Y. Bai, X. Wang, Y.-p. Cao, Y. Ge, C. Yuan, and Y. Shan, “Dreamdiffusion: Generating high-quality images from brain eeg signals,” arXiv preprint arXiv:2306.16934, 2023.
  165. H. Fu, Z. Shen, J. J. Chin, and H. Wang, “Brainvis: Exploring the bridge between brain and visual signals via image reconstruction,” arXiv preprint arXiv:2312.14871, 2023.
  166. Y. Yang, K. Zhang, Y. Ge, W. Shao, Z. Xue, Y. Qiao, and P. Luo, “Align, adapt and inject: Sound-guided unified image generation,” arXiv preprint arXiv:2306.11504, 2023.
  167. R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant, “Character-aware models improve visual text rendering,” arXiv preprint arXiv:2212.10562, 2022.
  168. J. Ma, M. Zhao, C. Chen, R. Wang, D. Niu, H. Lu, and X. Lin, “Glyphdraw: Learning to draw chinese characters in image synthesis models coherently,” arXiv preprint arXiv:2303.17870, 2023.
  169. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei, “Textdiffuser: Diffusion models as text painters,” arXiv preprint arXiv:2305.10855, 2023.
  170. Y. Yang, D. Gui, Y. Yuan, H. Ding, H. Hu, and K. Chen, “Glyphcontrol: Glyph conditional control for visual text generation,” arXiv preprint arXiv:2305.18259, 2023.
  171. Y. Tuo, W. Xiang, J.-Y. He, Y. Geng, and X. Xie, “Anytext: Multilingual visual text generation and editing,” arXiv preprint arXiv:2311.03054, 2023.
  172. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei, “Textdiffuser-2: Unleashing the power of language models for text rendering,” arXiv preprint arXiv:2311.16465, 2023.
  173. Y. Zhao and Z. Lian, “Udifftext: A unified framework for high-quality text synthesis in arbitrary images via character-aware diffusion models,” arXiv preprint arXiv:2312.04884, 2023.
  174. L. Zhang, X. Chen, Y. Wang, Y. Lu, and Y. Qiao, “Brush your text: Synthesize any scene text on images via diffusion model,” arXiv preprint arXiv:2312.12232, 2023.
  175. L. Huang, D. Chen, Y. Liu, Y. Shen, D. Zhao, and J. Zhou, “Composer: Creative and controllable image synthesis with composable conditions,” arXiv preprint arXiv:2302.09778, 2023.
  176. M. Hu, J. Zheng, D. Liu, C. Zheng, C. Wang, D. Tao, and T.-J. Cham, “Cocktail: Mixing multi-modality controls for text-conditional image generation,” arXiv preprint arXiv:2306.00964, 2023.
  177. J. S. Smith, Y.-C. Hsu, L. Zhang, T. Hua, Z. Kira, Y. Shen, and H. Jin, “Continual diffusion: Continual customization of text-to-image diffusion with c-lora,” arXiv preprint arXiv:2304.06027, 2023.
  178. G. Sun, W. Liang, J. Dong, J. Li, Z. Ding, and Y. Cong, “Create your world: Lifelong text-to-image diffusion,” arXiv preprint arXiv:2309.04430, 2023.
  179. J. S. Smith, Y.-C. Hsu, Z. Kira, Y. Shen, and H. Jin, “Continual diffusion with stamina: Stack-and-mask incremental adapters,” arXiv preprint arXiv:2311.18763, 2023.
  180. Z. Liu, R. Feng, K. Zhu, Y. Zhang, K. Zheng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones: Concept neurons in diffusion models for customized generation,” arXiv preprint arXiv:2303.05125, 2023.
  181. Y. Gu, X. Wang, J. Z. Wu, Y. Shi, Y. Chen, Z. Fan, W. Xiao, R. Zhao, S. Chang, W. Wu et al., “Mix-of-show: Decentralized low-rank adaptation for multi-concept customization of diffusion models,” arXiv preprint arXiv:2305.18292, 2023.
  182. V. Shah, N. Ruiz, F. Cole, E. Lu, S. Lazebnik, Y. Li, and V. Jampani, “Ziplora: Any subject in any style by effectively merging loras,” arXiv preprint arXiv:2311.13600, 2023.
  183. R. Po, G. Yang, K. Aberman, and G. Wetzstein, “Orthogonal adaptation for modular customization of diffusion models,” arXiv preprint arXiv:2312.02432, 2023.
  184. Z. Liu, Y. Zhang, Y. Shen, K. Zheng, K. Zhu, R. Feng, Y. Liu, D. Zhao, J. Zhou, and Y. Cao, “Cones 2: Customizable image synthesis with multiple subjects,” arXiv preprint arXiv:2305.19327, 2023.
  185. L. Wang, G. Shen, Y. Li, and Y.-c. Chen, “Decompose and realign: Tackling condition misalignment in text-to-image diffusion models,” arXiv preprint arXiv:2306.14408, 2023.
  186. S. Kim, J. Lee, K. Hong, D. Kim, and N. Ahn, “Diffblender: Scalable and composable multimodal text-to-image diffusion models,” arXiv preprint arXiv:2305.15194, 2023.
  187. Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y. Wang, Y. Rao, J. Liu, T. Huang et al., “Generative multimodal models are in-context learners,” arXiv preprint arXiv:2312.13286, 2023.
  188. C. Luo, “Understanding diffusion models: A unified perspective,” arXiv preprint arXiv:2208.11970, 2022.
  189. S. Huang, B. Gong, Y. Feng, X. Chen, Y. Fu, Y. Liu, and D. Wang, “Learning disentangled identifiers for action-customized text-to-image generation,” arXiv preprint arXiv:2311.15841, 2023.
  190. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
  191. N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2790–2799.
  192. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  193. M. Valipour, M. Rezagholizadeh, I. Kobyzev, and A. Ghodsi, “Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation,” arXiv preprint arXiv:2210.07558, 2022.
  194. A. Chavan, Z. Liu, D. Gupta, E. Xing, and Z. Shen, “One-for-all: Generalized lora for parameter-efficient fine-tuning,” arXiv preprint arXiv:2306.07967, 2023.
  195. J. Li, D. Li, S. Savarese, and S. C. H. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” in International Conference on Machine Learning, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:256390509
  196. P. Abbeel and A. Y. Ng, “Apprenticeship learning via inverse reinforcement learning,” in Proceedings of the twenty-first international conference on Machine learning, 2004, p. 1.
  197. Q. Cao, L. Shen, W. Xie, O. M. Parkhi, and A. Zisserman, “Vggface2: A dataset for recognising faces across pose and age,” in 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018).   IEEE, 2018, pp. 67–74.
  198. C. Szegedy, S. Ioffe, V. Vanhoucke, and A. Alemi, “Inception-v4, inception-resnet and the impact of residual connections on learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 31, no. 1, 2017.
  199. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly et al., “An image is worth 16x16 words: Transformers for image recognition at scale,” in International Conference on Learning Representations, 2020.
  200. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  201. K. Sohn, H. Chang, J. Lezama, L. Polania, H. Zhang, Y. Hao, I. Essa, and L. Jiang, “Visual prompt tuning for generative transfer learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19 840–19 851.
  202. X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 1501–1510.
  203. Z. Tian, C. Shen, H. Chen, and T. He, “Fcos: Fully convolutional one-stage object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 9627–9636.
  204. A. Toshev and C. Szegedy, “Deeppose: Human pose estimation via deep neural networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 1653–1660.
  205. M. Andriluka, L. Pishchulin, P. Gehler, and B. Schiele, “2d human pose estimation: New benchmark and state of the art analysis,” in Proceedings of the IEEE Conference on computer Vision and Pattern Recognition, 2014, pp. 3686–3693.
  206. J. Wang, S. Tan, X. Zhen, S. Xu, F. Zheng, Z. He, and L. Shao, “Deep 3d human pose estimation: A review,” Computer Vision and Image Understanding, vol. 210, p. 103225, 2021.
  207. F. Zhou, J. Yin, and P. Li, “Lifting by image–leveraging image cues for accurate 3d human pose estimation,” arXiv preprint arXiv:2312.15636, 2023.
  208. L. Yang, Q. Song, Z. Wang, and M. Jiang, “Parsing r-cnn for instance-level human analysis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 364–373.
  209. L. Yang, Q. Song, Z. Wang, M. Hu, C. Liu, X. Xin, W. Jia, and S. Xu, “Renovating parsing r-cnn for accurate multiple human parsing,” in European Conference on Computer Vision.   Springer, 2020, pp. 421–437.
  210. L. Yang, Q. Song, X. Xin, and Z. Liu, “Quality-aware network for face parsing,” arXiv preprint arXiv:2106.07368, 2021.
  211. L. Yang, Z. Liu, T. Zhou, and Q. Song, “Part decomposition and refinement network for human parsing,” IEEE/CAA Journal of Automatica Sinica, vol. 9, no. 6, pp. 1111–1114, 2022.
  212. L. Yang, Q. Song, Z. Wang, Z. Liu, S. Xu, and Z. Li, “Quality-aware network for human parsing,” IEEE Transactions on Multimedia, 2022.
  213. B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba, “Scene parsing through ade20k dataset,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 633–641.
  214. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  215. C. Qin, S. Zhang, N. Yu, Y. Feng, X. Yang, Y. Zhou, H. Wang, J. C. Niebles, C. Xiong, S. Savarese et al., “Unicontrol: A unified diffusion model for controllable visual generation in the wild,” arXiv preprint arXiv:2305.11147, 2023.
  216. J. Yang, J. Zhao, P. Wang, Z. Wang, and Y. Liang, “Meta controlnet: Enhancing task adaptation via meta learning,” arXiv preprint arXiv:2312.01255, 2023.
  217. D. Zavadski, J.-F. Feiden, and C. Rother, “Controlnet-xs: Designing an efficient and effective architecture for controlling text-to-image diffusion models,” arXiv preprint arXiv:2312.06573, 2023.
  218. J. Xiao, K. Zhu, H. Zhang, Z. Liu, Y. Shen, Y. Liu, X. Fu, and Z.-J. Zha, “Ccm: Adding conditional controls to text-to-image consistency models,” arXiv preprint arXiv:2312.06971, 2023.
  219. C. Mou, X. Wang, L. Xie, J. Zhang, Z. Qi, Y. Shan, and X. Qie, “T2i-adapter: Learning adapters to dig out more controllable ability for text-to-image diffusion models,” arXiv preprint arXiv:2302.08453, 2023.
  220. J. Xiao, L. Li, H. Lv, S. Wang, and Q. Huang, “R&b: Region and boundary aware zero-shot grounded text-to-image generation,” arXiv preprint arXiv:2310.08872, 2023.
  221. H. Touvron, L. Martin, K. Stone, P. Albert, A. Almahairi, Y. Babaei, N. Bashlykov, S. Batra, P. Bhargava, S. Bhosale et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023.
  222. A. Conneau, K. Khandelwal, N. Goyal, V. Chaudhary, G. Wenzek, F. Guzmán, E. Grave, M. Ott, L. Zettlemoyer, and V. Stoyanov, “Unsupervised cross-lingual representation learning at scale,” arXiv preprint arXiv:1911.02116, 2019.
  223. I. Kavasidis, S. Palazzo, C. Spampinato, D. Giordano, and M. Shah, “Brain2image: Converting brain signals into images,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1809–1817.
  224. P. Tirupattur, Y. S. Rawat, C. Spampinato, and M. Shah, “Thoughtviz: Visualizing human thoughts using generative adversarial network,” in Proceedings of the 26th ACM international conference on Multimedia, 2018, pp. 950–958.
  225. K. K. Bhargav, S. Ambika, S. Deepak, and S. Sudha, “Imagenation - a dcgan based method for image reconstruction from fmri,” 2020 Fifth International Conference on Research in Computational Intelligence and Communication Networks (ICRCICN), pp. 112–119, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:229373783
  226. S. Lin, T. Sprague, and A. K. Singh, “Mind reader: Reconstructing complex images from brain activities,” Advances in Neural Information Processing Systems, vol. 35, pp. 29 624–29 636, 2022.
  227. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  228. J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” in International Conference on Learning Representations, 2020.
  229. A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan et al., “Searching for mobilenetv3,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 1314–1324.
  230. K. Zhang, Z. Zhang, Z. Li, and Y. Qiao, “Joint face detection and alignment using multitask cascaded convolutional networks,” IEEE signal processing letters, vol. 23, no. 10, pp. 1499–1503, 2016.
  231. F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embedding for face recognition and clustering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 815–823.
  232. S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time object detection with region proposal networks,” Advances in neural information processing systems, vol. 28, 2015.
  233. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  234. Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2023, pp. 6027–6037.
  235. J. Choi, Y. Choi, Y. Kim, J. Kim, and S. Yoon, “Custom-edit: Text-guided image editing with customized diffusion models,” arXiv preprint arXiv:2305.15779, 2023.
  236. Z. Zhang, Z. Huang, and J. Liao, “Continuous layout editing of single images with diffusion models,” in Computer Graphics Forum, vol. 42, no. 7.   Wiley Online Library, 2023, p. e14966.
  237. S. Xie, Y. Zhao, Z. Xiao, K. C. Chan, Y. Li, Y. Xu, K. Zhang, and T. Hou, “Dreaminpainter: Text-guided subject-driven image inpainting with diffusion models,” arXiv preprint arXiv:2312.03771, 2023.
  238. L. Tang, N. Ruiz, Q. Chu, Y. Li, A. Holynski, D. E. Jacobs, B. Hariharan, Y. Pritch, N. Wadhwa, K. Aberman et al., “Realfill: Reference-driven generation for authentic image completion,” arXiv preprint arXiv:2309.16668, 2023.
  239. S. Yang, X. Chen, and J. Liao, “Uni-paint: A unified framework for multimodal image inpainting with pretrained diffusion model,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 3190–3199.
  240. Y. Song, Z. Zhang, Z. Lin, S. Cohen, B. Price, J. Zhang, S. Y. Kim, and D. Aliaga, “Objectstitch: Object compositing with diffusion model,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 310–18 319.
  241. L. Lu, B. Zhang, and L. Niu, “Dreamcom: Finetuning text-guided inpainting model for image composition,” arXiv preprint arXiv:2309.15508, 2023.
  242. B. Zhang, Y. Duan, J. Lan, Y. Hong, H. Zhu, W. Wang, and L. Niu, “Controlcom: Controllable image composition using diffusion model,” arXiv preprint arXiv:2308.10040, 2023.
  243. R. Liu, R. Wu, B. Van Hoorick, P. Tokmakov, S. Zakharov, and C. Vondrick, “Zero-1-to-3: Zero-shot one image to 3d object,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 9298–9309.
  244. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  245. B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” in The Eleventh International Conference on Learning Representations, 2022.
  246. A. Raj, S. Kaza, B. Poole, M. Niemeyer, N. Ruiz, B. Mildenhall, S. Zada, K. Aberman, M. Rubinstein, J. Barron et al., “Dreambooth3d: Subject-driven text-to-3d generation,” arXiv preprint arXiv:2303.13508, 2023.
  247. Y. Chen, Y. Pan, Y. Li, T. Yao, and T. Mei, “Control3d: Towards controllable text-to-3d generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 1148–1156.
  248. T. Huang, Y. Zeng, Z. Zhang, W. Xu, H. Xu, S. Xu, R. W. Lau, and W. Zuo, “Dreamcontrol: Control-based text-to-3d generation with 3d self-prior,” arXiv preprint arXiv:2312.06439, 2023.
  249. C. Yu, Q. Zhou, J. Li, Z. Zhang, Z. Wang, and F. Wang, “Points-to-3d: Bridging the gap between sparse points and shape-controllable text-to-3d generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 6841–6850.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Pu Cao (10 papers)
  2. Feng Zhou (195 papers)
  3. Qing Song (23 papers)
  4. Lu Yang (82 papers)
Citations (20)