Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LLMs Meet Multimodal Generation and Editing: A Survey (2405.19334v2)

Published 29 May 2024 in cs.AI, cs.CL, cs.CV, cs.MM, and cs.SD

Abstract: With the recent advancement in LLMs, there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal LLMs (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

Definition Search Book Streamline Icon: https://streamlinehq.com
References (529)
  1. OpenAI, “Video generation models as world simulators,” OpenAI, Tech. Rep., 2024. [Online]. Available: https://openai.com/research/video-generation-models-as-world-simulators
  2. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  3. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  4. A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
  5. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  6. J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  7. OpenAI, “Chatgpt: A language model for conversational ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https://www.openai.com/research/chatgpt
  8. Y. Li, C. Wang, and J. Jia, “Llama-vid: An image is worth 2 tokens in large language models,” arXiv preprint arXiv:2311.17043, 2023.
  9. P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv preprint arXiv:2304.15010, 2023.
  10. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  11. A. Nichol, P. Dhariwal, A. Ramesh, P. Shyam, P. Mishkin, B. McGrew, I. Sutskever, and M. Chen, “Glide: Towards photorealistic image generation and editing with text-guided diffusion models,” arXiv preprint arXiv:2112.10741, 2021.
  12. A. Ramesh, P. Dhariwal, A. Nichol, C. Chu, and M. Chen, “Hierarchical text-conditional image generation with clip latents,” arXiv preprint arXiv:2204.06125, vol. 1, no. 2, p. 3, 2022.
  13. C. Saharia, W. Chan, S. Saxena, L. Li, J. Whang, E. L. Denton, K. Ghasemipour, R. Gontijo Lopes, B. Karagol Ayan, T. Salimans et al., “Photorealistic text-to-image diffusion models with deep language understanding,” Advances in Neural Information Processing Systems, vol. 35, pp. 36 479–36 494, 2022.
  14. Y. He, T. Yang, Y. Zhang, Y. Shan, and Q. Chen, “Latent video diffusion models for high-fidelity long video generation,” arXiv preprint arXiv:2211.13221, 2022.
  15. D. Zhou, W. Wang, H. Yan, W. Lv, Y. Zhu, and J. Feng, “Magicvideo: Efficient video generation with latent diffusion models,” arXiv preprint arXiv:2211.11018, 2022.
  16. U. Singer, A. Polyak, T. Hayes, X. Yin, J. An, S. Zhang, Q. Hu, H. Yang, O. Ashual, O. Gafni et al., “Make-a-video: Text-to-video generation without text-video data,” arXiv preprint arXiv:2209.14792, 2022.
  17. J. Ho, W. Chan, C. Saharia, J. Whang, R. Gao, A. Gritsenko, D. P. Kingma, B. Poole, M. Norouzi, D. J. Fleet et al., “Imagen video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  18. R. Villegas, M. Babaeizadeh, P.-J. Kindermans, H. Moraldo, H. Zhang, M. T. Saffar, S. Castro, J. Kunze, and D. Erhan, “Phenaki: Variable length video generation from open domain textual descriptions,” in International Conference on Learning Representations, 2022.
  19. H. Chen, M. Xia, Y. He, Y. Zhang, X. Cun, S. Yang, J. Xing, Y. Liu, Q. Chen, X. Wang et al., “Videocrafter1: Open diffusion models for high-quality video generation,” arXiv preprint arXiv:2310.19512, 2023.
  20. Y. Guo, C. Yang, A. Rao, Y. Wang, Y. Qiao, D. Lin, and B. Dai, “Animatediff: Animate your personalized text-to-image diffusion models without specific tuning,” arXiv preprint arXiv:2307.04725, 2023.
  21. O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, Y. Li, T. Michaeli et al., “Lumiere: A space-time diffusion model for video generation,” arXiv preprint arXiv:2401.12945, 2024.
  22. R. Girdhar, M. Singh, A. Brown, Q. Duval, S. Azadi, S. S. Rambhatla, A. Shah, X. Yin, D. Parikh, and I. Misra, “Emu video: Factorizing text-to-video generation by explicit image conditioning,” arXiv preprint arXiv:2311.10709, 2023.
  23. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International conference on machine learning.   PMLR, 2021, pp. 8748–8763.
  24. A. Sanghi, H. Chu, J. G. Lambourne, Y. Wang, C.-Y. Cheng, M. Fumero, and K. R. Malekshan, “Clip-forge: Towards zero-shot text-to-shape generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 603–18 613.
  25. N. Mohammad Khalid, T. Xie, E. Belilovsky, and T. Popa, “Clip-mesh: Generating textured meshes from text using pretrained image-text models,” in SIGGRAPH Asia 2022 conference papers, 2022, pp. 1–8.
  26. O. Michel, R. Bar-On, R. Liu, S. Benaim, and R. Hanocka, “Text2mesh: Text-driven neural stylization for meshes,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 492–13 502.
  27. C. Wang, R. Jiang, M. Chai, M. He, D. Chen, and J. Liao, “Nerf-art: Text-driven neural radiance fields stylization,” IEEE Transactions on Visualization and Computer Graphics, 2023.
  28. B. Kerbl, G. Kopanas, T. Leimkühler, and G. Drettakis, “3d gaussian splatting for real-time radiance field rendering,” ACM Transactions on Graphics, vol. 42, no. 4, July 2023. [Online]. Available: https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/
  29. T. Yi, J. Fang, J. Wang, G. Wu, L. Xie, X. Zhang, W. Liu, Q. Tian, and X. Wang, “Gaussiandreamer: Fast generation from text to 3d gaussians by bridging 2d and 3d diffusion models,” arXiv preprint arXiv:2310.08529, 2023.
  30. J. Tang, J. Ren, H. Zhou, Z. Liu, and G. Zeng, “Dreamgaussian: Generative gaussian splatting for efficient 3d content creation,” arXiv preprint arXiv:2309.16653, 2023.
  31. L. Höllein, A. Cao, A. Owens, J. Johnson, and M. Nießner, “Text2room: Extracting textured 3d meshes from 2d text-to-image models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 7909–7920.
  32. Y. Liang, X. Yang, J. Lin, H. Li, X. Xu, and Y. Chen, “LucidDreamer: Towards high-fidelity text-to-3d generation via interval score matching,” https://arxiv.org/abs/2311.11284, 2023.
  33. X. Yu, Y.-C. Guo, Y. Li, D. Liang, S.-H. Zhang, and X. Qi, “Text-to-3d with classifier score distillation,” https://arxiv.org/abs/2310.19415, 2023.
  34. W. Li, R. Chen, X. Chen, and P. Tan, “SweetDreamer: Aligning geometric priors in 2d diffusion for consistent text-to-3d,” https://arxiv.org/abs/2310.02596, 2023.
  35. Z. Wang, C. Lu, Y. Wang, F. Bao, C. Li, H. Su, and J. Zhu, “Prolificdreamer: High-fidelity and diverse text-to-3d generation with variational score distillation,” arXiv preprint arXiv:2305.16213, 2023.
  36. J. Lorraine, K. Xie, X. Zeng, C.-H. Lin, T. Takikawa, N. Sharp, T.-Y. Lin, M.-Y. Liu, S. Fidler, and J. Lucas, “Att3d: Amortized text-to-3d object synthesis,” in International Conference on Computer Vision ICCV, 2023.
  37. J. Xu, X. Wang, W. Cheng, Y.-P. Cao, Y. Shan, X. Qie, and S. Gao, “Dream3D: Zero-shot text-to-3d synthesis using 3d shape prior and text-to-image diffusion models,” https://arxiv.org/abs/2212.14704, 2023.
  38. J. Zhu and P. Zhuang, “HiFA: High-fidelity text-to-3d with advanced diffusion guidance,” https://arxiv.org/abs/2305.18766, 2023.
  39. R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023.
  40. C. Tsalicoglou, F. Manhardt, A. Tonioni, M. Niemeyer, and F. Tombari, “Textmesh: Generation of realistic 3d meshes from text prompts,” arXiv preprint arXiv:2304.12439, 2023.
  41. B. Poole, A. Jain, J. T. Barron, and B. Mildenhall, “Dreamfusion: Text-to-3d using 2d diffusion,” arXiv preprint arXiv:2209.14988, 2022.
  42. C.-H. Lin, J. Gao, L. Tang, T. Takikawa, X. Zeng, X. Huang, K. Kreis, S. Fidler, M.-Y. Liu, and T.-Y. Lin, “Magic3d: High-resolution text-to-3d content creation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 300–309.
  43. J. Seo, W. Jang, M.-S. Kwak, J. Ko, H. Kim, J. Kim, J.-H. Kim, J. Lee, and S. Kim, “Let 2d diffusion model know 3d-consistency for robust text-to-3d generation,” arXiv preprint arXiv:2303.07937, 2023.
  44. H. Liu, Z. Chen, Y. Yuan, X. Mei, X. Liu, D. Mandic, W. Wang, and M. D. Plumbley, “Audioldm: Text-to-audio generation with latent diffusion models,” arXiv preprint arXiv:2301.12503, 2023.
  45. H. Liu, Q. Tian, Y. Yuan, X. Liu, X. Mei, Q. Kong, Y. Wang, W. Wang, Y. Wang, and M. D. Plumbley, “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
  46. F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. Défossez, J. Copet, D. Parikh, Y. Taigman, and Y. Adi, “Audiogen: Textually guided audio generation,” arXiv preprint arXiv:2209.15352, 2022.
  47. A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi et al., “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
  48. J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y. Adi, and A. Défossez, “Simple and controllable music generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  49. S. Forsgren and H. Martiros, “Riffusion-stable diffusion for real-time music generation, 2022,” URL https://riffusion. com/about, vol. 6, 2022.
  50. X. Tan, J. Chen, H. Liu, J. Cong, C. Zhang, Y. Liu, X. Wang, Y. Leng, Y. Yi, L. He et al., “Naturalspeech: End-to-end text-to-speech synthesis with human-level quality,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
  51. K. Shen, Z. Ju, X. Tan, Y. Liu, Y. Leng, L. He, T. Qin, S. Zhao, and J. Bian, “Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers,” arXiv preprint arXiv:2304.09116, 2023.
  52. Z. Ju, Y. Wang, K. Shen, X. Tan, D. Xin, D. Yang, Y. Liu, Y. Leng, K. Song, S. Tang et al., “Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models,” arXiv preprint arXiv:2403.03100, 2024.
  53. C. Wang, S. Chen, Y. Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y. Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023.
  54. Z. Jiang, J. Liu, Y. Ren, J. He, C. Zhang, Z. Ye, P. Wei, C. Wang, X. Yin, Z. Ma et al., “Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts,” arXiv preprint arXiv:2307.07218, 2023.
  55. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “Fastspeech 2: Fast and high-quality end-to-end text to speech,” arXiv preprint arXiv:2006.04558, 2020.
  56. Y. Ge, Y. Ge, Z. Zeng, X. Wang, and Y. Shan, “Planting a seed of vision in large language model,” arXiv preprint arXiv:2307.08041, 2023.
  57. L. Zeqiang, Z. Xizhou, D. Jifeng, Q. Yu, and W. Wenhai, “Mini-dalle3: Interactive text to image by prompting large language models,” arXiv preprint arXiv:2310.07653, 2023.
  58. Z. Tang, Z. Yang, M. Khademi, Y. Liu, C. Zhu, and M. Bansal, “Codi-2: In-context, interleaved, and interactive any-to-any generation,” arXiv preprint arXiv:2311.18775, 2023.
  59. Y. Ge, S. Zhao, Z. Zeng, Y. Ge, C. Li, X. Wang, and Y. Shan, “Making llama see and draw with seed tokenizer,” arXiv preprint arXiv:2310.01218, 2023.
  60. Q. Sun, Y. Cui, X. Zhang, F. Zhang, Q. Yu, Z. Luo, Y. Wang, Y. Rao, J. Liu, T. Huang et al., “Generative multimodal models are in-context learners,” arXiv preprint arXiv:2312.13286, 2023.
  61. X. Zhao, B. Liu, Q. Liu, G. Shi, and X.-M. Wu, “Making multimodal generation easier: When diffusion models meet llms,” arXiv preprint arXiv:2310.08949, 2023.
  62. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei, “Textdiffuser-2: Unleashing the power of language models for text rendering,” arXiv preprint arXiv:2311.16465, 2023.
  63. L. Lian, B. Li, A. Yala, and T. Darrell, “Llm-grounded diffusion: Enhancing prompt understanding of text-to-image diffusion models with large language models,” arXiv preprint arXiv:2305.13655, 2023.
  64. W. Feng, W. Zhu, T.-j. Fu, V. Jampani, A. Akula, X. He, S. Basu, X. E. Wang, and W. Y. Wang, “Layoutgpt: Compositional visual planning and generation with large language models,” arXiv preprint arXiv:2305.15393, 2023.
  65. T. Zhang, Y. Zhang, V. Vineet, N. Joshi, and X. Wang, “Controllable text-to-image generation with gpt-4,” arXiv preprint arXiv:2305.18583, 2023.
  66. L. Qu, S. Wu, H. Fei, L. Nie, and T.-S. Chua, “Layoutllm-t2i: Eliciting layout guidance from llm for text-to-image generation,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 643–654.
  67. Y. Li, H. Liu, Q. Wu, F. Mu, J. Yang, J. Gao, C. Li, and Y. J. Lee, “Gligen: Open-set grounded text-to-image generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 511–22 521.
  68. J. Betker, G. Goh, L. Jing, T. Brooks, J. Wang, L. Li, L. Ouyang, J. Zhuang, J. Lee, Y. Guo et al., “Improving image generation with better captions,” Computer Science. https://cdn. openai. com/papers/dall-e-3. pdf, 2023.
  69. D. Kondratyuk, L. Yu, X. Gu, J. Lezama, J. Huang, R. Hornung, H. Adam, H. Akbari, Y. Alon, V. Birodkar et al., “Videopoet: A large language model for zero-shot video generation,” arXiv preprint arXiv:2312.14125, 2023.
  70. L. Yu, J. Lezama, N. B. Gundavarapu, L. Versari, K. Sohn, D. Minnen, Y. Cheng, A. Gupta, X. Gu, A. G. Hauptmann et al., “Language model beats diffusion–tokenizer is key to visual generation,” arXiv preprint arXiv:2310.05737, 2023.
  71. H. Fei, S. Wu, W. Ji, H. Zhang, and T.-S. Chua, “Empowering dynamics-aware text-to-video diffusion with large language models,” arXiv preprint arXiv:2308.13812, 2023.
  72. H. Lin, A. Zala, J. Cho, and M. Bansal, “Videodirectorgpt: Consistent multi-scene video generation via llm-guided planning,” arXiv preprint arXiv:2309.15091, 2023.
  73. J. Lv, Y. Huang, M. Yan, J. Huang, J. Liu, Y. Liu, Y. Wen, X. Chen, and S. Chen, “Gpt4motion: Scripting physical motions in text-to-video generation via blender-oriented gpt planning,” arXiv preprint arXiv:2311.12631, 2023.
  74. Y. Lu, L. Zhu, H. Fan, and Y. Yang, “Flowzero: Zero-shot text-to-video synthesis with llm-driven dynamic scene syntax,” arXiv preprint arXiv:2311.15813, 2023.
  75. S. Hong, J. Seo, S. Hong, H. Shin, and S. Kim, “Large language models are frame-level directors for zero-shot text-to-video generation,” arXiv preprint arXiv:2305.14330, 2023.
  76. H. Huang, Y. Feng, C. Shi, L. Xu, J. Yu, and S. Yang, “Free-bloom: Zero-shot text-to-video generator with llm director and ldm animator,” arXiv preprint arXiv:2309.14494, 2023.
  77. Z. Wang, J. Wang, D. Lin, and B. Dai, “Intercontrol: Generate human motion interactions by controlling every joint,” arXiv preprint arXiv:2311.15864, 2023.
  78. J. Liu, W. Dai, C. Wang, Y. Cheng, Y. Tang, and X. Tong, “Plan, posture and go: Towards open-world text-to-motion generation,” arXiv preprint arXiv:2312.14828, 2023.
  79. F. Long, Z. Qiu, T. Yao, and T. Mei, “Videodrafter: Content-consistent multi-scene video generation with llm,” arXiv preprint arXiv:2401.01256, 2024.
  80. C. Sun, J. Han, W. Deng, X. Wang, Z. Qin, and S. Gould, “3d-gpt: Procedural 3d modeling with large language models,” arXiv preprint arXiv:2310.12945, 2023.
  81. Y. Feng, J. Lin, S. K. Dwivedi, Y. Sun, P. Patel, and M. J. Black, “Posegpt: Chatting about 3d human pose,” arXiv preprint arXiv:2311.18836, 2023.
  82. S. Chen, X. Chen, C. Zhang, M. Li, G. Yu, H. Fei, H. Zhu, J. Fan, and T. Chen, “Ll3da: Visual interactive instruction tuning for omni-3d understanding, reasoning, and planning,” arXiv preprint arXiv:2311.18651, 2023.
  83. T. Wu, G. Yang, Z. Li, K. Zhang, Z. Liu, L. Guibas, D. Lin, and G. Wetzstein, “Gpt-4v(ision) is a human-aligned evaluator for text-to-3d generation,” IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024.
  84. D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” arXiv preprint arXiv:2305.11000, 2023.
  85. Y. Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” arXiv preprint arXiv:2305.10790, 2023.
  86. S. Deshmukh, B. Elizalde, R. Singh, and H. Wang, “Pengi: An audio language model for audio tasks,” arXiv preprint arXiv:2305.11834, 2023.
  87. P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonov et al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023.
  88. S. Liu, A. S. Hussain, C. Sun, and Y. Shan, “Music understanding llama: Advancing text-to-music generation with question answering and captioning,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 286–290.
  89. Q. Chen, Y. Chu, Z. Gao, Z. Li, K. Hu, X. Zhou, J. Xu, Z. Ma, W. Wang, S. Zheng et al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” arXiv preprint arXiv:2310.04673, 2023.
  90. J. Gardner, S. Durand, D. Stoller, and R. M. Bittner, “Llark: A multimodal foundation model for music,” arXiv preprint arXiv:2310.07160, 2023.
  91. C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,” arXiv preprint arXiv:2310.13289, 2023.
  92. Y. Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models,” arXiv preprint arXiv:2311.07919, 2023.
  93. A. S. Hussain, S. Liu, C. Sun, and Y. Shan, “M 2 ugen: Multi-modal music understanding and generation with the power of large language models,” arXiv preprint arXiv:2311.11255, 2023.
  94. Y. Shu, S. Dong, G. Chen, W. Huang, R. Zhang, D. Shi, Q. Xiang, and Y. Shi, “Llasm: Large language and speech model,” arXiv preprint arXiv:2308.15930, 2023.
  95. R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou et al., “Chatmusician: Understanding and generating music intrinsically with llm,” arXiv preprint arXiv:2402.16153, 2024.
  96. S. Ding, Z. Liu, X. Dong, P. Zhang, R. Qian, C. He, D. Lin, and J. Wang, “Songcomposer: A large language model for lyric and melody composition in song generation,” arXiv preprint arXiv:2402.17645, 2024.
  97. S.-L. Wu, C. Donahue, S. Watanabe, and N. J. Bryan, “Music controlnet: Multiple time-varying controls for music generation,” arXiv preprint arXiv:2311.07069, 2023.
  98. D. Ghosal, N. Majumder, A. Mehrish, and S. Poria, “Text-to-audio generation using instruction-tuned llm and latent diffusion model,” arXiv preprint arXiv:2304.13731, 2023.
  99. S.-L. Wu, X. Chang, G. Wichern, J.-w. Jung, F. Germain, J. Le Roux, and S. Watanabe, “Improving audio captioning models with fine-grained audio features, text embedding supervision, and llm mix-up augmentation,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 316–320.
  100. J. Huang, Y. Ren, R. Huang, D. Yang, Z. Ye, C. Zhang, J. Liu, X. Yin, Z. Ma, and Z. Zhao, “Make-an-audio 2: Temporal-enhanced text-to-audio generation,” arXiv preprint arXiv:2305.18474, 2023.
  101. Z. Wang, S. Mao, W. Wu, Y. Xia, Y. Deng, and J. Tien, “Assessing phrase break of esl speech with pre-trained language models and large language models,” arXiv preprint arXiv:2306.04980, 2023.
  102. A. Vyas, B. Shi, M. Le, A. Tjandra, Y.-C. Wu, B. Guo, J. Zhang, X. Zhang, R. Adkins, W. Ngan et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
  103. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” in Advances in Neural Information Processing Systems, 2023.
  104. R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y. Wu, Z. Hong, J. Huang, J. Liu et al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” arXiv preprint arXiv:2304.12995, 2023.
  105. X. Liu, Z. Zhu, H. Liu, Y. Yuan, M. Cui, Q. Huang, J. Liang, Y. Cao, Q. Kong, M. D. Plumbley et al., “Wavjourney: Compositional audio creation with large language models,” arXiv preprint arXiv:2307.14335, 2023.
  106. D. Yu, K. Song, P. Lu, T. He, X. Tan, W. Ye, S. Zhang, and J. Bian, “Musicagent: An ai agent for music understanding and generation with large language models,” arXiv preprint arXiv:2310.11954, 2023.
  107. Y. Zhang, A. Maezawa, G. Xia, K. Yamamoto, and S. Dixon, “Loop copilot: Conducting ai ensembles for music generation and iterative editing,” arXiv preprint arXiv:2310.12404, 2023.
  108. L. Zhuo, R. Yuan, J. Pan, Y. Ma, Y. LI, G. Zhang, S. Liu, R. Dannenberg, J. Fu, C. Lin et al., “Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt,” arXiv preprint arXiv:2306.17103, 2023.
  109. P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,” arXiv preprint arXiv:2005.00341, 2020.
  110. Z. Borsos, R. Marinier, D. Vincent, E. Kharitonov, O. Pietquin, M. Sharifi, D. Roblek, O. Teboul, D. Grangier, M. Tagliasacchi et al., “Audiolm: a language modeling approach to audio generation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  111. D. Yang, J. Tian, X. Tan, R. Huang, S. Liu, X. Chang, J. Shi, S. Zhao, J. Bian, X. Wu et al., “Uniaudio: An audio foundation model toward universal audio generation,” arXiv preprint arXiv:2310.00704, 2023.
  112. W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023.
  113. S. Minaee, T. Mikolov, N. Nikzad, M. Chenaghlu, R. Socher, X. Amatriain, and J. Gao, “Large language models: A survey,” arXiv preprint arXiv:2402.06196, 2024.
  114. Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang, W. Ye, Y. Zhang, Y. Chang, P. S. Yu, Q. Yang, and X. Xie, “A survey on evaluation of large language models,” ACM Trans. Intell. Syst. Technol., vol. 15, no. 3, mar 2024. [Online]. Available: https://doi.org/10.1145/3641289
  115. C. Zhang, C. Zhang, M. Zhang, and I. S. Kweon, “Text-to-image diffusion model in generative ai: A survey,” arXiv preprint arXiv:2303.07909, 2023.
  116. Z. Xing, Q. Feng, H. Chen, Q. Dai, H. Hu, H. Xu, Z. Wu, and Y.-G. Jiang, “A survey on video diffusion models,” arXiv preprint arXiv:2310.10647, 2023.
  117. Z. Shi, S. Peng, Y. Xu, A. Geiger, Y. Liao, and Y. Shen, “Deep generative models on 3d representations: A survey,” arXiv preprint arXiv:2210.15663, 2022.
  118. C. Zhang, C. Zhang, S. Zheng, M. Zhang, M. Qamar, S.-H. Bae, and I. S. Kweon, “Audio diffusion model for speech synthesis: A survey on text to speech and speech enhancement in generative ai,” arXiv preprint arXiv:2303.13336, 2023.
  119. Q. Sun, Y. Fang, L. Wu, X. Wang, and Y. Cao, “Eva-clip: Improved training techniques for clip at scale,” arXiv preprint arXiv:2303.15389, 2023.
  120. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” Journal of machine learning research, vol. 21, no. 140, pp. 1–67, 2020.
  121. L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin et al., “A survey on large language model based autonomous agents,” arXiv preprint arXiv:2308.11432, 2023.
  122. S. Yin, C. Fu, S. Zhao, K. Li, X. Sun, T. Xu, and E. Chen, “A survey on multimodal large language models,” arXiv preprint arXiv:2306.13549, 2023.
  123. J. Wu, W. Gan, Z. Chen, S. Wan, and S. Y. Philip, “Multimodal large language models: A survey,” in 2023 IEEE International Conference on Big Data (BigData).   IEEE, 2023, pp. 2247–2256.
  124. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020.
  125. D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” arXiv preprint arXiv:1312.6114, 2013.
  126. P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883.
  127. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” in International conference on machine learning.   PMLR, 2015, pp. 2256–2265.
  128. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
  129. Y. Tian, D. Krishnan, and P. Isola, “Contrastive multiview coding,” in Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16.   Springer, 2020, pp. 776–794.
  130. C. Wang, M. Chai, M. He, D. Chen, and J. Liao, “Clip-nerf: Text-and-image driven manipulation of neural radiance fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 3835–3844.
  131. B. Elizalde, S. Deshmukh, M. Al Ismail, and H. Wang, “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  132. S. Luo, C. Yan, C. Hu, and H. Zhao, “Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  133. R. Girdhar, A. El-Nouby, Z. Liu, M. Singh, K. V. Alwala, A. Joulin, and I. Misra, “Imagebind: One embedding space to bind them all,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 180–15 190.
  134. S. Wu, H. Fei, L. Qu, W. Ji, and T.-S. Chua, “Next-gpt: Any-to-any multimodal llm,” arXiv preprint arXiv:2309.05519, 2023.
  135. Y. Xing, Y. He, Z. Tian, X. Wang, and Q. Chen, “Seeing and hearing: Open-domain visual-audio generation with diffusion latent aligners,” arXiv preprint arXiv:2402.17723, 2024.
  136. Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv preprint arXiv:2305.16355, 2023.
  137. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar, A. Rodriguez, A. Joulin, E. Grave, and G. Lample, “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
  138. Y. Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan et al., “Training a helpful and harmless assistant with reinforcement learning from human feedback,” arXiv preprint arXiv:2204.05862, 2022.
  139. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv preprint arXiv:2304.10592, 2023.
  140. W.-G. Chen, I. Spiridonova, J. Yang, J. Gao, and C. Li, “Llava-interactive: An all-in-one demo for image chat, segmentation, generation and editing,” arXiv preprint arXiv:2311.00571, 2023.
  141. R. Pi, L. Yao, J. Gao, J. Zhang, and T. Zhang, “Perceptiongpt: Effectively fusing visual perception into llm,” arXiv preprint arXiv:2311.06612, 2023.
  142. R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv preprint arXiv:2305.14167, 2023.
  143. J. Gao, R. Pi, J. Zhang, J. Ye, W. Zhong, Y. Wang, L. Hong, J. Han, H. Xu, Z. Li et al., “G-llava: Solving geometric problem with multi-modal large language model,” arXiv preprint arXiv:2312.11370, 2023.
  144. Q. Sun, Q. Yu, Y. Cui, F. Zhang, X. Zhang, Y. Wang, H. Gao, J. Liu, T. Huang, and X. Wang, “Generative pretraining in multimodality,” arXiv preprint arXiv:2307.05222, 2023.
  145. F. Yu, A. Seff, Y. Zhang, S. Song, T. Funkhouser, and J. Xiao, “Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop,” arXiv preprint arXiv:1506.03365, 2015.
  146. Z. Liu, P. Luo, X. Wang, and X. Tang, “Large-scale celebfaces attributes (celeba) dataset,” Retrieved August, vol. 15, no. 2018, p. 11, 2018.
  147. “Dalle-1,” https://openai.com/research/dall-e.
  148. R. Dong, C. Han, Y. Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei et al., “Dreamllm: Synergistic multimodal comprehension and creation,” arXiv preprint arXiv:2309.11499, 2023.
  149. “Dalle-3,” https://openai.com/dall-e-3.
  150. J. Y. Koh, R. Salakhutdinov, and D. Fried, “Grounding language models to images for multimodal generation,” arXiv preprint arXiv:2301.13823, 2023.
  151. J. Y. Koh, D. Fried, and R. Salakhutdinov, “Generating images with multimodal language models,” arXiv preprint arXiv:2305.17216, 2023.
  152. L. Yu, Y. Cheng, Z. Wang, V. Kumar, W. Macherey, Y. Huang, D. A. Ross, I. Essa, Y. Bisk, M.-H. Yang et al., “Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms,” arXiv preprint arXiv:2306.17842, 2023.
  153. L. Yu, B. Shi, R. Pasunuru, B. Muller, O. Golovneva, T. Wang, A. Babu, B. Tang, B. Karrer, S. Sheynin et al., “Scaling autoregressive multi-modal models: Pretraining and instruction tuning,” arXiv preprint arXiv:2309.02591, 2023.
  154. K. Zheng, X. He, and X. E. Wang, “Minigpt-5: Interleaved vision-and-language generation via generative vokens,” arXiv preprint arXiv:2310.02239, 2023.
  155. J. An, Z. Yang, L. Li, J. Wang, K. Lin, Z. Liu, L. Wang, and J. Luo, “Openleaf: Open-domain interleaved image-text generation and evaluation,” arXiv preprint arXiv:2310.07749, 2023.
  156. Z. Yang, Y. Zhang, F. Meng, and J. Zhou, “Teal: Tokenize and embed all for multi-modal large language models,” arXiv preprint arXiv:2311.04589, 2023.
  157. B. Xia, S. Wang, Y. Tao, Y. Wang, and J. Jia, “Llmga: Multimodal large language model based generation assistant,” arXiv preprint arXiv:2311.16500, 2023.
  158. X. Chi, Y. Liu, Z. Jiang, R. Zhang, Z. Lin, R. Zhang, P. Gao, C. Fu, S. Zhang, Q. Liu et al., “Chatillusion: Efficient-aligning interleaved generation ability with visual instruction model,” arXiv preprint arXiv:2311.17963, 2023.
  159. Y. Zhou, R. Zhang, J. Gu, and T. Sun, “Customization assistant for text-to-image generation,” arXiv preprint arXiv:2312.03045, 2023.
  160. X. Shen and M. Elhoseiny, “Storygpt-v: Large language models as consistent story visualizers,” 2023.
  161. X. Hu, R. Wang, Y. Fang, B. Fu, P. Cheng, and G. Yu, “Ella: Equip diffusion models with llm for enhanced semantic alignment,” arXiv preprint arXiv:2403.05135, 2024.
  162. S. Zhao, S. Hao, B. Zi, H. Xu, and K.-Y. K. Wong, “Bridging different language models and generative vision models for text-to-image generation,” arXiv preprint arXiv:2403.07860, 2024.
  163. J. Cho, A. Zala, and M. Bansal, “Visual programming for text-to-image generation and evaluation,” arXiv preprint arXiv:2305.15328, 2023.
  164. H. Gani, S. F. Bhat, M. Naseer, S. Khan, and P. Wonka, “Llm blueprint: Enabling text-to-image generation with complex and detailed prompts,” arXiv preprint arXiv:2310.10640, 2023.
  165. T.-H. Wu, L. Lian, J. E. Gonzalez, B. Li, and T. Darrell, “Self-correcting llm-controlled diffusion models,” arXiv preprint arXiv:2311.16090, 2023.
  166. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei, “Textdiffuser: Diffusion models as text painters,” arXiv preprint arXiv:2305.10855, 2023.
  167. P. Jia, C. Li, Z. Liu, Y. Shen, X. Chen, Y. Yuan, Y. Zheng, D. Chen, J. Li, X. Xie et al., “Cole: A hierarchical generation framework for graphic design,” arXiv preprint arXiv:2311.16974, 2023.
  168. S. Zhong, Z. Huang, W. Wen, J. Qin, and L. Lin, “Sur-adapter: Enhancing text-to-image pre-trained diffusion models with large language models,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 567–578.
  169. Q. Yu, J. Li, W. Ye, S. Tang, and Y. Zhuang, “Interactive data synthesis for systematic vision adaptation via llms-aigcs collaboration,” arXiv preprint arXiv:2305.12799, 2023.
  170. X. Wang, B. Zhuang, and Q. Wu, “Switchgpt: Adapting large language models for non-text outputs,” arXiv preprint arXiv:2309.07623, 2023.
  171. J. Liao, X. Chen, Q. Fu, L. Du, X. He, X. Wang, S. Han, and D. Zhang, “Text-to-image generation for abstract concepts,” arXiv preprint arXiv:2309.14623, 2023.
  172. Z. Yang, J. Wang, L. Li, K. Lin, C.-C. Lin, Z. Liu, and L. Wang, “Idea2img: Iterative self-refinement with gpt-4v (ision) for automatic image design and generation,” arXiv preprint arXiv:2310.08541, 2023.
  173. J.-Y. He, Z.-Q. Cheng, C. Li, J. Sun, W. Xiang, X. Lin, X. Kang, Z. Jin, Y. Hu, B. Luo et al., “Wordart designer: User-driven artistic typography synthesis using large language models,” arXiv preprint arXiv:2310.18332, 2023.
  174. J. Sun, D. Fu, Y. Hu, S. Wang, R. Rassin, D.-C. Juan, D. Alon, C. Herrmann, S. van Steenkiste, R. Krishna et al., “Dreamsync: Aligning text-to-image generation with image understanding feedback,” arXiv preprint arXiv:2311.17946, 2023.
  175. Y. Lu, X. Yang, X. Li, X. E. Wang, and W. Y. Wang, “Llmscore: Unveiling the power of large language models in text-to-image synthesis evaluation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  176. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” Advances in neural information processing systems, vol. 24, 2011.
  177. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13.   Springer, 2014, pp. 740–755.
  178. C. Jia, Y. Yang, Y. Xia, Y.-T. Chen, Z. Parekh, H. Pham, Q. Le, Y.-H. Sung, Z. Li, and T. Duerig, “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning.   PMLR, 2021, pp. 4904–4916.
  179. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568.
  180. K. Srinivasan, K. Raman, J. Chen, M. Bendersky, and M. Najork, “Wit: Wikipedia-based image text dataset for multimodal multilingual machine learning,” in Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021, pp. 2443–2449.
  181. C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv preprint arXiv:2111.02114, 2021.
  182. Y. Zheng, H. Yang, T. Zhang, J. Bao, D. Chen, Y. Huang, L. Yuan, D. Chen, M. Zeng, and F. Wen, “General facial representation learning in a visual-linguistic manner,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 697–18 709.
  183. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23 716–23 736, 2022.
  184. LAION-COCO, “https://laion.ai/blog/laion-coco/,” 2022. [Online]. Available: https://laion.ai/blog/laion-coco/
  185. C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman et al., “Laion-5b: An open large-scale dataset for training next generation image-text models,” Advances in Neural Information Processing Systems, vol. 35, pp. 25 278–25 294, 2022.
  186. Coyo-700M, “https://huggingface.co/datasets/kakaobrain/coyo-700m,” 2022. [Online]. Available: https://huggingface.co/datasets/kakaobrain/coyo-700m
  187. S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, B. Patra et al., “Language is not all you need: Aligning perception with language models,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  188. W. Zhu, J. Hessel, A. Awadalla, S. Y. Gadre, J. Dodge, A. Fang, Y. Yu, L. Schmidt, W. Y. Wang, and Y. Choi, “Multimodal c4: An open, billion-scale corpus of images interleaved with text,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  189. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” Advances in neural information processing systems, vol. 36, 2024.
  190. S. Y. Gadre, G. Ilharco, A. Fang, J. Hayase, G. Smyrnis, T. Nguyen, R. Marten, M. Wortsman, D. Ghosh, J. Zhang et al., “Datacomp: In search of the next generation of multimodal datasets,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  191. J. Chen, Y. Huang, T. Lv, L. Cui, Q. Chen, and F. Wei, “Textdiffuser: Diffusion models as text painters,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  192. Y. Yang, D. Gui, Y. Yuan, W. Liang, H. Ding, H. Hu, and K. Chen, “Glyphcontrol: Glyph conditional control for visual text generation,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  193. B. Li, Y. Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li, and Z. Liu, “Mimic-it: Multi-modal in-context instruction tuning,” arXiv preprint arXiv:2306.05425, 2023.
  194. T. Vetter and T. Poggio, “Linear object classes and image synthesis from a single example image,” IEEE Transactions on pattern analysis and machine intelligence, vol. 19, no. 7, pp. 733–742, 1997.
  195. D. Kirk and J. Arvo, “Unbiased sampling techniques for image synthesis,” ACM SIGGRAPH Computer Graphics, vol. 25, no. 4, pp. 153–156, 1991.
  196. K. Frans, L. Soros, and O. Witkowski, “Clipdraw: Exploring text-to-drawing synthesis through language-image encoders,” Advances in Neural Information Processing Systems, vol. 35, pp. 5207–5218, 2022.
  197. S. Gu, D. Chen, J. Bao, F. Wen, B. Zhang, D. Chen, L. Yuan, and B. Guo, “Vector quantized diffusion model for text-to-image synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 10 696–10 706.
  198. Z. Wang, W. Liu, Q. He, X. Wu, and Z. Yi, “Clip-gen: Language-free training of a text-to-image generator with clip,” arXiv preprint arXiv:2203.00386, 2022.
  199. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv preprint arXiv:2304.08485, 2023.
  200. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv preprint arXiv:2301.12597, 2023.
  201. T. Lv, Y. Huang, J. Chen, L. Cui, S. Ma, Y. Chang, S. Huang, W. Wang, L. Dong, W. Luo et al., “Kosmos-2.5: A multimodal literate model,” arXiv preprint arXiv:2309.11419, 2023.
  202. Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,” arXiv preprint arXiv:2304.14178, 2023.
  203. J. Ye, A. Hu, H. Xu, Q. Ye, M. Yan, Y. Dan, C. Zhao, G. Xu, C. Li, J. Tian et al., “mplug-docowl: Modularized multimodal large language model for document understanding,” arXiv preprint arXiv:2307.02499, 2023.
  204. J. Chen, D. Zhu, X. Shen, X. Li, Z. Liu, P. Zhang, R. Krishnamoorthi, V. Chandra, Y. Xiong, and M. Elhoseiny, “Minigpt-v2: large language model as a unified interface for vision-language multi-task learning,” arXiv preprint arXiv:2310.09478, 2023.
  205. R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv preprint arXiv:2303.16199, 2023.
  206. W. Dai, J. Li, D. Li, A. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023,” arXiv preprint arXiv:2305.06500.
  207. W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” arXiv preprint arXiv:2305.11175, 2023.
  208. J. Bai, S. Bai, S. Yang, S. Wang, S. Tan, P. Wang, J. Lin, C. Zhou, and J. Zhou, “Qwen-vl: A frontier large vision-language model with versatile abilities,” arXiv preprint arXiv:2308.12966, 2023.
  209. W. Wang, Q. Lv, W. Yu, W. Hong, J. Qi, Y. Wang, J. Ji, Z. Yang, L. Zhao, X. Song et al., “Cogvlm: Visual expert for pretrained language models,” arXiv preprint arXiv:2311.03079, 2023.
  210. Y. Jin, K. Xu, L. Chen, C. Liao, J. Tan, B. Chen, C. Lei, A. Liu, C. Song, X. Lei et al., “Unified language-vision pretraining with dynamic discrete visual tokenization,” arXiv preprint arXiv:2309.04669, 2023.
  211. Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, Z. Muyan, Q. Zhang, X. Zhu, L. Lu et al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” arXiv preprint arXiv:2312.14238, 2023.
  212. S. Huang, L. Dong, W. Wang, Y. Hao, S. Singhal, S. Ma, T. Lv, L. Cui, O. K. Mohammed, Q. Liu et al., “Language is not all you need: Aligning perception with language models,” arXiv preprint arXiv:2302.14045, 2023.
  213. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv preprint arXiv:2303.04671, 2023.
  214. Z. Liu, Y. He, W. Wang, W. Wang, Y. Wang, S. Chen, Q. Zhang, Y. Yang, Q. Li, J. Yu et al., “Internchat: Solving vision-centric tasks by interacting with chatbots beyond language,” arXiv preprint arXiv:2305.05662, 2023.
  215. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv preprint arXiv:2303.11381, 2023.
  216. S. Wang, Z. Zhao, X. Ouyang, Q. Wang, and D. Shen, “Chatcad: Interactive computer-aided diagnosis on medical image using large language models,” arXiv preprint arXiv:2302.07257, 2023.
  217. Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 3, 2022, pp. 3081–3089.
  218. Z. Gu, B. Zhu, G. Zhu, Y. Chen, M. Tang, and J. Wang, “Anomalygpt: Detecting industrial anomalies using large vision-language models,” arXiv preprint arXiv:2308.15366, 2023.
  219. J. Qin, J. Wu, W. Chen, Y. Ren, H. Li, H. Wu, X. Xiao, R. Wang, and S. Wen, “Diffusiongpt: Llm-driven text-to-image generation system,” arXiv preprint arXiv:2401.10061, 2024.
  220. L. Yang, Z. Yu, C. Meng, M. Xu, S. Ermon, and B. Cui, “Mastering text-to-image diffusion: Recaptioning, planning, and generating with multimodal llms,” arXiv preprint arXiv:2401.11708, 2024.
  221. E. J. Hu, Y. Shen, P. Wallis, Z. Allen-Zhu, Y. Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” in International Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=nZeVKeeFYf9
  222. C. Meng, Y. He, Y. Song, J. Song, J. Wu, J.-Y. Zhu, and S. Ermon, “Sdedit: Guided image synthesis and editing with stochastic differential equations,” arXiv preprint arXiv:2108.01073, 2021.
  223. G. Kim, T. Kwon, and J. C. Ye, “Diffusionclip: Text-guided diffusion models for robust image manipulation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 2426–2435.
  224. G. Couairon, J. Verbeek, H. Schwenk, and M. Cord, “Diffedit: Diffusion-based semantic image editing with mask guidance,” arXiv preprint arXiv:2210.11427, 2022.
  225. A. Hertz, R. Mokady, J. Tenenbaum, K. Aberman, Y. Pritch, and D. Cohen-Or, “Prompt-to-prompt image editing with cross attention control,” arXiv preprint arXiv:2208.01626, 2022.
  226. R. Mokady, A. Hertz, K. Aberman, Y. Pritch, and D. Cohen-Or, “Null-text inversion for editing real images using guided diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6038–6047.
  227. B. Kawar, S. Zada, O. Lang, O. Tov, H. Chang, T. Dekel, I. Mosseri, and M. Irani, “Imagic: Text-based real image editing with diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6007–6017.
  228. N. Tumanyan, M. Geyer, S. Bagon, and T. Dekel, “Plug-and-play diffusion features for text-driven image-to-image translation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 1921–1930.
  229. R. Morita, Z. Zhang, M. M. Ho, and J. Zhou, “Interactive image manipulation with complex text instructions,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 1053–1062.
  230. Z. Zhang, L. Han, A. Ghosh, D. N. Metaxas, and J. Ren, “Sine: Single image editing with text-to-image diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6027–6037.
  231. G. Parmar, K. Kumar Singh, R. Zhang, Y. Li, J. Lu, and J.-Y. Zhu, “Zero-shot image-to-image translation,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
  232. V. Goel, E. Peruzzo, Y. Jiang, D. Xu, N. Sebe, T. Darrell, Z. Wang, and H. Shi, “Pair-diffusion: Object-level image editing with structure-and-appearance paired diffusion models,” arXiv preprint arXiv:2303.17546, 2023.
  233. M. Cao, X. Wang, Z. Qi, Y. Shan, X. Qie, and Y. Zheng, “Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing,” arXiv preprint arXiv:2304.08465, 2023.
  234. T. Nguyen, Y. Li, U. Ojha, and Y. J. Lee, “Visual instruction inversion: Image editing via visual prompting,” arXiv preprint arXiv:2307.14331, 2023.
  235. A. Mirzaei, T. Aumentado-Armstrong, M. A. Brubaker, J. Kelly, A. Levinshtein, K. G. Derpanis, and I. Gilitschenski, “Watch your steps: Local image and scene editing by text instructions,” arXiv preprint arXiv:2308.08947, 2023.
  236. C. Mou, X. Wang, J. Song, Y. Shan, and J. Zhang, “Dragondiffusion: Enabling drag-style manipulation on diffusion models,” arXiv preprint arXiv:2307.02421, 2023.
  237. ——, “Diffeditor: Boosting accuracy and flexibility on diffusion-based image editing,” arXiv preprint arXiv:2402.02583, 2024.
  238. T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” arXiv preprint arXiv:2211.09800, 2022.
  239. X. Cui, Z. Li, P. Li, Y. Hu, H. Shi, C. Cao, and Z. He, “Chatedit: Towards multi-turn interactive facial image editing via dialogue,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 14 567–14 583.
  240. T.-J. Fu, W. Hu, X. Du, W. Y. Wang, Y. Yang, and Z. Gan, “Guiding instruction-based image editing via multimodal large language models,” arXiv preprint arXiv:2309.17102, 2023.
  241. S. Sheynin, A. Polyak, U. Singer, Y. Kirstain, A. Zohar, O. Ashual, D. Parikh, and Y. Taigman, “Emu edit: Precise image editing via recognition and generation tasks,” arXiv preprint arXiv:2311.10089, 2023.
  242. J. Z. Wu, Y. Ge, X. Wang, S. W. Lei, Y. Gu, Y. Shi, W. Hsu, Y. Shan, X. Qie, and M. Z. Shou, “Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 7623–7633.
  243. E. Molad, E. Horwitz, D. Valevski, A. R. Acha, Y. Matias, Y. Pritch, Y. Leviathan, and Y. Hoshen, “Dreamix: Video diffusion models are general video editors,” arXiv preprint arXiv:2302.01329, 2023.
  244. S. Liu, Y. Zhang, W. Li, Z. Lin, and J. Jia, “Video-p2p: Video editing with cross-attention control,” arXiv preprint arXiv:2303.04761, 2023.
  245. C. Qi, X. Cun, Y. Zhang, C. Lei, X. Wang, Y. Shan, and Q. Chen, “Fatezero: Fusing attentions for zero-shot text-based video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 932–15 942.
  246. D. Ceylan, C.-H. P. Huang, and N. J. Mitra, “Pix2video: Video editing using image diffusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 206–23 217.
  247. W. Chai, X. Guo, G. Wang, and Y. Lu, “Stablevideo: Text-driven consistency-aware diffusion video editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 23 040–23 050.
  248. S. Yang, Y. Zhou, Z. Liu, and C. C. Loy, “Rerender a video: Zero-shot text-guided video-to-video translation,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–11.
  249. M. Geyer, O. Bar-Tal, S. Bagon, and T. Dekel, “Tokenflow: Consistent diffusion features for consistent video editing,” arXiv preprint arXiv:2307.10373, 2023.
  250. H. Ouyang, Q. Wang, Y. Xiao, Q. Bai, J. Zhang, K. Zheng, X. Zhou, Q. Chen, and Y. Shen, “Codef: Content deformation fields for temporally consistent video processing,” arXiv preprint arXiv:2308.07926, 2023.
  251. J. H. Liew, H. Yan, J. Zhang, Z. Xu, and J. Feng, “Magicedit: High-fidelity and temporally coherent video editing,” arXiv preprint arXiv:2308.14749, 2023.
  252. Y. Ma, X. Cun, Y. He, C. Qi, X. Wang, Y. Shan, X. Li, and Q. Chen, “Magicstick: Controllable video editing via control handle transformations,” arXiv preprint arXiv:2312.03047, 2023.
  253. J. Cheng, T. Xiao, and T. He, “Consistent video-to-video transfer using synthetic dataset,” arXiv preprint arXiv:2311.00213, 2023.
  254. B. Qin, J. Li, S. Tang, T.-S. Chua, and Y. Zhuang, “Instructvid2vid: Controllable video editing with natural language instructions,” arXiv preprint arXiv:2305.12328, 2023.
  255. T. Brooks, A. Holynski, and A. A. Efros, “Instructpix2pix: Learning to follow image editing instructions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 18 392–18 402.
  256. B. Wu, S. Yu, Z. Chen, J. B. Tenenbaum, and C. Gan, “Star: A benchmark for situated reasoning in real-world videos,” in Thirty-fifth conference on neural information processing systems datasets and benchmarks track (Round 2), 2021.
  257. Z. Liu, L. Wang, W. Wu, C. Qian, and T. Lu, “Tam: Temporal adaptive module for video recognition,” in Proceedings of the IEEE/CVF international conference on computer vision, 2021, pp. 13 708–13 718.
  258. M. Zhao, B. Li, J. Wang, W. Li, W. Zhou, L. Zhang, S. Xuyang, Z. Yu, X. Yu, G. Li et al., “Towards video text visual question answering: benchmark and baseline,” Advances in Neural Information Processing Systems, vol. 35, pp. 35 549–35 562, 2022.
  259. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.
  260. A. Yang, A. Miech, J. Sivic, I. Laptev, and C. Schmid, “Zero-shot video question answering via frozen bidirectional language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 124–141, 2022.
  261. ——, “Learning to answer visual questions from web videos,” arXiv preprint arXiv:2205.05019, 2022.
  262. K. Lin, L. Li, C.-C. Lin, F. Ahmed, Z. Gan, Z. Liu, Y. Lu, and L. Wang, “Swinbert: End-to-end transformers with sparse attention for video captioning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 17 949–17 958.
  263. J. Lin, C. Gan, and S. Han, “Tsm: Temporal shift module for efficient video understanding,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 7083–7093.
  264. G. Bertasius, H. Wang, and L. Torresani, “Is space-time attention all you need for video understanding?” in ICML, vol. 2, no. 3, 2021, p. 4.
  265. C.-Y. Wu, C. Feichtenhofer, H. Fan, K. He, P. Krahenbuhl, and R. Girshick, “Long-term feature banks for detailed video understanding,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 284–293.
  266. H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv preprint arXiv:2306.02858, 2023.
  267. J. Chen, D. Zhu, K. Haydarov, X. Li, and M. Elhoseiny, “Video chatcaptioner: Towards the enriched spatiotemporal descriptions,” arXiv preprint arXiv:2304.04227, 2023.
  268. J. Ho, T. Salimans, A. Gritsenko, W. Chan, M. Norouzi, and D. J. Fleet, “Video diffusion models,” Advances in Neural Information Processing Systems, vol. 35, pp. 8633–8646, 2022.
  269. A. Blattmann, R. Rombach, H. Ling, T. Dockhorn, S. W. Kim, S. Fidler, and K. Kreis, “Align your latents: High-resolution video synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 563–22 575.
  270. Y. He, M. Xia, H. Chen, X. Cun, Y. Gong, J. Xing, Y. Zhang, X. Wang, C. Weng, Y. Shan et al., “Animate-a-story: Storytelling with retrieval-augmented video generation,” arXiv preprint arXiv:2307.06940, 2023.
  271. Y. Bao, D. Qiu, G. Kang, B. Zhang, B. Jin, K. Wang, and P. Yan, “Latentwarp: Consistent diffusion latents for zero-shot video-to-video translation,” arXiv preprint arXiv:2311.00353, 2023.
  272. J. Xu, T. Mei, T. Yao, and Y. Rui, “Msr-vtt: A large video description dataset for bridging video and language,” 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5288–5296, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:206594535
  273. A. Rohrbach, A. Torabi, M. Rohrbach, N. Tandon, C. J. Pal, H. Larochelle, A. C. Courville, and B. Schiele, “Movie description,” International Journal of Computer Vision, vol. 123, pp. 94 – 120, 2016. [Online]. Available: https://api.semanticscholar.org/CorpusID:18217052
  274. R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. C. Niebles, “Dense-captioning events in videos,” 2017 IEEE International Conference on Computer Vision (ICCV), pp. 706–715, 2017. [Online]. Available: https://api.semanticscholar.org/CorpusID:1026139
  275. R. Sanabria, O. Caglayan, S. Palaskar, D. Elliott, L. Barrault, L. Specia, and F. Metze, “How2: A large-scale dataset for multimodal language understanding,” ArXiv, vol. abs/1811.00347, 2018. [Online]. Available: https://api.semanticscholar.org/CorpusID:53186236
  276. X. E. Wang, J. Wu, J. Chen, L. Li, Y. fang Wang, and W. Y. Wang, “Vatex: A large-scale, high-quality multilingual dataset for video-and-language research,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 4580–4590, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:102352148
  277. A. Miech, D. Zhukov, J.-B. Alayrac, M. Tapaswi, I. Laptev, and J. Sivic, “Howto100m: Learning a text-video embedding by watching hundred million narrated video clips,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 2630–2640, 2019. [Online]. Available: https://api.semanticscholar.org/CorpusID:182952863
  278. J. C. Stroud, D. A. Ross, C. Sun, J. Deng, R. Sukthankar, and C. Schmid, “Learning video representations from textual web supervision,” ArXiv, vol. abs/2007.14937, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:220845567
  279. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 1708–1718, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:232478955
  280. R. Zellers, X. Lu, J. Hessel, Y. Yu, J. S. Park, J. Cao, A. Farhadi, and Y. Choi, “Merlot: Multimodal neural script knowledge models,” in Neural Information Processing Systems, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:235352775
  281. H. Xue, T. Hang, Y. Zeng, Y. Sun, B. Liu, H. Yang, J. Fu, and B. Guo, “Advancing high-resolution video-language representation with large-scale video transcriptions,” 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5026–5035, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID:244462849
  282. A. Nagrani, P. H. Seo, B. Seybold, A. Hauth, S. Manén, C. Sun, and C. Schmid, “Learning audio-video modalities from image captions,” in European Conference on Computer Vision, 2022. [Online]. Available: https://api.semanticscholar.org/CorpusID:247939759
  283. W. Wang, H. Yang, Z. Tuo, H. He, J. Zhu, J. Fu, and J. Liu, “Videofactory: Swap attention in spatiotemporal diffusions for text-to-video generation,” ArXiv, vol. abs/2305.10874, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:258762479
  284. Y. Wang, Y. He, Y. Li, K. Li, J. Yu, X. J. Ma, X. Chen, Y. Wang, P. Luo, Z. Liu, Y. Wang, L. Wang, and Y. Qiao, “Internvid: A large-scale video-text dataset for multimodal understanding and generation,” ArXiv, vol. abs/2307.06942, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:259847783
  285. T.-S. Chen, A. Siarohin, W. Menapace, E. Deyneka, H.-w. Chao, B. E. Jeon, Y. Fang, H.-Y. Lee, J. Ren, M.-H. Yang et al., “Panda-70m: Captioning 70m videos with multiple cross-modality teachers,” arXiv preprint arXiv:2402.19479, 2024.
  286. “Vript,” https://github.com/mutonix/Vript.
  287. J. Yu, H. Zhu, L. Jiang, C. C. Loy, W. T. Cai, and W. Wu, “Celebv-text: A large-scale facial text-video dataset,” 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 14 805–14 814, 2023. [Online]. Available: https://api.semanticscholar.org/CorpusID:257767123
  288. X. Li, Q. Zhang, D. Kang, W. Cheng, Y. Gao, J. Zhang, Z. Liang, J. Liao, Y.-P. Cao, and Y. Shan, “Advances in 3d generation: A survey,” arXiv preprint arXiv:2401.17807, 2024.
  289. M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153.
  290. A. Ramesh, M. Pavlov, G. Goh, S. Gray, C. Voss, A. Radford, M. Chen, and I. Sutskever, “Zero-shot text-to-image generation,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8821–8831.
  291. H. Wang, X. Du, J. Li, R. A. Yeh, and G. Shakhnarovich, “Score jacobian chaining: Lifting pretrained 2d diffusion models for 3d generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 619–12 629.
  292. B. Mildenhall, P. P. Srinivasan, M. Tancik, J. T. Barron, R. Ramamoorthi, and R. Ng, “Nerf: Representing scenes as neural radiance fields for view synthesis,” Communications of the ACM, vol. 65, no. 1, pp. 99–106, 2021.
  293. P. Wang, L. Liu, Y. Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” arXiv preprint arXiv:2106.10689, 2021.
  294. T. Shen, J. Gao, K. Yin, M.-Y. Liu, and S. Fidler, “Deep marching tetrahedra: a hybrid representation for high-resolution 3d shape synthesis,” Advances in Neural Information Processing Systems, vol. 34, pp. 6087–6101, 2021.
  295. M. Deitke, R. Liu, M. Wallingford, H. Ngo, O. Michel, A. Kusupati, A. Fan, C. Laforte, V. Voleti, S. Y. Gadre et al., “Objaverse-xl: A universe of 10m+ 3d objects,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  296. J. Reizenstein, R. Shapovalov, P. Henzler, L. Sbordone, P. Labatut, and D. Novotny, “Common objects in 3d: Large-scale learning and evaluation of real-life 3d category reconstruction,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 10 901–10 911.
  297. A. Nichol, H. Jun, P. Dhariwal, P. Mishkin, and M. Chen, “Point-e: A system for generating 3d point clouds from complex prompts,” arXiv preprint arXiv:2212.08751, 2022.
  298. J. Lei, Y. Zhang, K. Jia et al., “Tango: Text-driven photorealistic and robust 3d stylization via lighting decomposition,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 923–30 936, 2022.
  299. Y. Ma, X. Zhang, X. Sun, J. Ji, H. Wang, G. Jiang, W. Zhuang, and R. Ji, “X-mesh: Towards fast and accurate text-driven 3d stylization via dynamic textual guidance,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 2749–2760.
  300. L. Dinh, J. Sohl-Dickstein, and S. Bengio, “Density estimation using real nvp,” arXiv preprint arXiv:1605.08803, 2016.
  301. A. Jain, B. Mildenhall, J. T. Barron, P. Abbeel, and B. Poole, “Zero-shot text-guided object generation with dream fields,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 867–876.
  302. F. Yin, X. Chen, C. Zhang, B. Jiang, Z. Zhao, J. Fan, G. Yu, T. Li, and T. Chen, “Shapegpt: 3d shape generation with a unified multi-modal language model,” arXiv preprint arXiv:2311.17618, 2023.
  303. G. Tevet, B. Gordon, A. Hertz, A. H. Bermano, and D. Cohen-Or, “Motionclip: Exposing human motion generation to clip space,” in European Conference on Computer Vision.   Springer, 2022, pp. 358–374.
  304. B. Jiang, X. Chen, W. Liu, J. Yu, G. Yu, and T. Chen, “Motiongpt: Human motion as a foreign language,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  305. Z. Zhou and S. Tulsiani, “Sparsefusion: Distilling view-conditioned diffusion for 3d reconstruction,” in CVPR, 2023.
  306. Z. Wan, D. Paschalidou, I. Huang, H. Liu, B. Shen, X. Xiang, J. Liao, and L. Guibas, “Cad: Photorealistic 3d generation via adversarial distillation,” arXiv preprint arXiv:2312.06663, 2023.
  307. B. Yang, W. Dong, L. Ma, W. Hu, X. Liu, Z. Cui, and Y. Ma, “Dreamspace: Dreaming your room space with text-driven panoramic texture propagation,” in 2024 IEEE Conference Virtual Reality and 3D User Interfaces (VR).   IEEE, 2024, pp. 650–660.
  308. G. Metzer, E. Richardson, O. Patashnik, R. Giryes, and D. Cohen-Or, “Latent-nerf for shape-guided generation of 3d shapes and textures,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 12 663–12 673.
  309. O. Katzir, O. Patashnik, D. Cohen-Or, and D. Lischinski, “Noise-free score distillation,” 2023.
  310. M. Armandpour, H. Zheng, A. Sadeghian, A. Sadeghian, and M. Zhou, “Re-imagine the negative prompt algorithm: Transform 2d diffusion into 3d, alleviate janus problem and beyond,” arXiv preprint arXiv:2304.04968, 2023.
  311. L. Zhou, A. Shih, C. Meng, and S. Ermon, “Dreampropeller: Supercharge text-to-3d generation with parallel sampling,” arXiv preprint arXiv:2311.17082, 2023.
  312. C. Yu, G. Lu, Y. Zeng, J. Sun, X. Liang, H. Li, Z. Xu, S. Xu, W. Zhang, and H. Xu, “Towards high-fidelity text-guided 3d face generation and manipulation using only images,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 15 326–15 337.
  313. C. Zhang, Y. Chen, Y. Fu, Z. Zhou, G. Yu, B. Wang, B. Fu, T. Chen, G. Lin, and C. Shen, “Styleavatar3d: Leveraging image-text diffusion models for high-fidelity 3d avatar generation,” arXiv preprint arXiv:2305.19012, 2023.
  314. T. Wang, B. Zhang, T. Zhang, S. Gu, J. Bao, T. Baltrusaitis, J. Shen, D. Chen, F. Wen, Q. Chen et al., “Rodin: A generative model for sculpting 3d digital avatars using diffusion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4563–4573.
  315. S. Aneja, J. Thies, A. Dai, and M. Nießner, “Clipface: Text-guided editing of textured 3d morphable models,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
  316. M. Wu, H. Zhu, L. Huang, Y. Zhuang, Y. Lu, and X. Cao, “High-fidelity 3d face generation from natural language descriptions,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 4521–4530.
  317. T. Liao, H. Yi, Y. Xiu, J. Tang, Y. Huang, J. Thies, and M. J. Black, “Tada! text to animatable digital avatars,” arXiv preprint arXiv:2308.10899, 2023.
  318. S. Huang, Z. Yang, L. Li, Y. Yang, and J. Jia, “Avatarfusion: Zero-shot generation of clothing-decoupled 3d avatars using 2d diffusion,” in Proceedings of the 31st ACM International Conference on Multimedia, 2023, pp. 5734–5745.
  319. X. Han, Y. Cao, K. Han, X. Zhu, J. Deng, Y.-Z. Song, T. Xiang, and K.-Y. K. Wong, “Headsculpt: Crafting 3d head avatars with text,” arXiv preprint arXiv:2306.03038, 2023.
  320. Y. Cao, Y.-P. Cao, K. Han, Y. Shan, and K.-Y. K. Wong, “Dreamavatar: Text-and-shape guided 3d human avatar generation via diffusion models,” arXiv preprint arXiv:2304.00916, 2023.
  321. H. Zhang, B. Chen, H. Yang, L. Qu, X. Wang, L. Chen, C. Long, F. Zhu, K. Du, and M. Zheng, “Avatarverse: High-quality & stable 3d avatar creation from text and pose,” arXiv preprint arXiv:2308.03610, 2023.
  322. L. Zhang, Q. Qiu, H. Lin, Q. Zhang, C. Shi, W. Yang, Y. Shi, S. Yang, L. Xu, and J. Yu, “Dreamface: Progressive generation of animatable 3d faces under text guidance,” arXiv preprint arXiv:2304.03117, 2023.
  323. F. Hong, M. Zhang, L. Pan, Z. Cai, L. Yang, and Z. Liu, “Avatarclip: Zero-shot text-driven generation and animation of 3d avatars,” arXiv preprint arXiv:2205.08535, 2022.
  324. N. Kolotouros, T. Alldieck, A. Zanfir, E. G. Bazavan, M. Fieraru, and C. Sminchisescu, “Dreamhuman: Animatable 3d avatars from text,” arXiv preprint arXiv:2306.09329, 2023.
  325. X. Huang, R. Shao, Q. Zhang, H. Zhang, Y. Feng, Y. Liu, and Q. Wang, “Humannorm: Learning normal diffusion model for high-quality and realistic 3d human generation,” arXiv preprint arXiv:2310.01406, 2023.
  326. Y. Zeng, Y. Lu, X. Ji, Y. Yao, H. Zhu, and X. Cao, “Avatarbooth: High-quality and customizable 3d human avatar generation,” arXiv preprint arXiv:2306.09864, 2023.
  327. D. Wang, H. Meng, Z. Cai, Z. Shao, Q. Liu, L. Wang, M. Fan, Y. Shan, X. Zhan, and Z. Wang, “Headevolver: Text to head avatars via locally learnable mesh deformation,” arXiv preprint arXiv:2403.09326, 2024.
  328. H. Liu, X. Wang, Z. Wan, Y. Shen, Y. Song, J. Liao, and Q. Chen, “Headartist: Text-conditioned 3d head generation with self score distillation,” arXiv preprint arXiv:2312.07539, 2023.
  329. Y. Shi, P. Wang, J. Ye, M. Long, K. Li, and X. Yang, “Mvdream: Multi-view diffusion for 3d generation,” arXiv preprint arXiv:2308.16512, 2023.
  330. Y. Kant, Z. Wu, M. Vasilkovsky, G. Qian, J. Ren, R. A. Guler, B. Ghanem, S. Tulyakov, I. Gilitschenski, and A. Siarohin, “Spad: Spatially aware multiview diffusers,” arXiv preprint arXiv:2402.05235, 2024.
  331. Z. Liu, Y. Li, Y. Lin, X. Yu, S. Peng, Y.-P. Cao, X. Qi, X. Huang, D. Liang, and W. Ouyang, “Unidream: Unifying diffusion priors for relightable text-to-3d generation,” 2023.
  332. L. Qiu, G. Chen, X. Gu, Q. zuo, M. Xu, Y. Wu, W. Yuan, Z. Dong, L. Bo, and X. Han, “Richdreamer: A generalizable normal-depth diffusion model for detail richness in text-to-3d,” arXiv preprint arXiv:2311.16918, 2023.
  333. J. Li, H. Tan, K. Zhang, Z. Xu, F. Luan, Y. Xu, Y. Hong, K. Sunkavalli, G. Shakhnarovich, and S. Bi, “Instant3d: Fast text-to-3d with sparse-view generation and large reconstruction model,” arXiv preprint arXiv:2311.06214, 2023.
  334. J. Tang, Z. Chen, X. Chen, T. Wang, G. Zeng, and Z. Liu, “Lgm: Large multi-view gaussian model for high-resolution 3d content creation,” arXiv preprint arXiv:2402.05054, 2024.
  335. X. Yinghao, S. Zifan, Y. Wang, C. Hansheng, Y. Ceyuan, P. Sida, S. Yujun, and W. Gordon, “Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation,” 2024.
  336. H. Jun and A. Nichol, “Shap-e: Generating conditional 3d implicit functions,” arXiv preprint arXiv:2305.02463, 2023.
  337. Z. Hu, A. Iscen, A. Jain, T. Kipf, Y. Yue, D. A. Ross, C. Schmid, and A. Fathi, “Scenecraft: An llm agent for synthesizing 3d scene as blender code,” arXiv preprint arXiv:2403.01248, 2024.
  338. R. Xu, X. Wang, T. Wang, Y. Chen, J. Pang, and D. Lin, “Pointllm: Empowering large language models to understand point clouds,” arXiv preprint arXiv:2308.16911, 2023.
  339. Y. Hong, H. Zhen, P. Chen, S. Zheng, Y. Du, Z. Chen, and C. Gan, “3d-llm: Injecting the 3d world into large language models,” arXiv preprint arXiv:2307.12981, 2023.
  340. O. Gordon, O. Avrahami, and D. Lischinski, “Blended-nerf: Zero-shot object generation and blending in existing neural radiance fields,” arXiv preprint arXiv:2306.12760, 2023.
  341. W. Gao, N. Aigerman, T. Groueix, V. Kim, and R. Hanocka, “Textdeformer: Geometry manipulation using text guidance,” in ACM SIGGRAPH 2023 Conference Proceedings, 2023, pp. 1–11.
  342. C. Bao, Y. Zhang, B. Yang, T. Fan, Z. Yang, H. Bao, G. Zhang, and Z. Cui, “Sine: Semantic-driven image-based nerf editing with prior-guided editing field,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 919–20 929.
  343. A. Mikaeili, O. Perel, M. Safaee, D. Cohen-Or, and A. Mahdavi-Amiri, “Sked: Sketch-guided text-based 3d editing,” in Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2023, pp. 14 607–14 619.
  344. J. Zhuang, C. Wang, L. Lin, L. Liu, and G. Li, “Dreameditor: Text-driven 3d scene editing with neural fields,” in SIGGRAPH Asia 2023 Conference Papers, 2023, pp. 1–10.
  345. A. Haque, M. Tancik, A. A. Efros, A. Holynski, and A. Kanazawa, “Instruct-nerf2nerf: Editing 3d scenes with instructions,” arXiv preprint arXiv:2303.12789, 2023.
  346. D. Decatur, I. Lang, K. Aberman, and R. Hanocka, “3d paintbrush: Local stylization of 3d shapes with cascaded score distillation,” arXiv preprint arXiv:2311.09571, 2023.
  347. G. Tevet, S. Raab, B. Gordon, Y. Shafir, D. Cohen-or, and A. H. Bermano, “Human motion diffusion model,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=SJ1kSyO2jwu
  348. R. Chen, Y. Chen, N. Jiao, and K. Jia, “Fantasia3d: Disentangling geometry and appearance for high-quality text-to-3d content creation,” arXiv preprint arXiv:2303.13873, 2023.
  349. Z. Pan, J. Lu, X. Zhu, and L. Zhang, “Enhancing high-resolution 3d generation through pixel-wise gradient clipping,” in International Conference on Learning Representations (ICLR), 2024.
  350. G. Qian, J. Cao, A. Siarohin, Y. Kant, C. Wang, M. Vasilkovsky, H.-Y. Lee, Y. Fang, I. Skorokhodov, P. Zhuang et al., “Atom: Amortized text-to-mesh using 2d diffusion,” arXiv preprint arXiv:2402.00867, 2024.
  351. Z. Wu, P. Zhou, X. Yi, X. Yuan, and H. Zhang, “Consistent3d: Towards consistent high-fidelity text-to-3d generation with deterministic sampling prior,” arXiv preprint arXiv:2401.09050, 2024.
  352. T. Huang, Y. Zeng, Z. Zhang, W. Xu, H. Xu, S. Xu, R. W. Lau, and W. Zuo, “Dreamcontrol: Control-based text-to-3d generation with 3d self-prior,” arXiv preprint arXiv:2312.06439, 2023.
  353. Y. Chen, C. Zhang, X. Yang, Z. Cai, G. Yu, L. Yang, and G. Lin, “It3d: Improved text-to-3d generation with explicit view synthesis,” 2023.
  354. M. Zhao, C. Zhao, X. Liang, L. Li, Z. Zhao, Z. Hu, C. Fan, and X. Yu, “Efficientdreamer: High-fidelity and robust 3d creation via orthogonal-view diffusion prior,” arXiv preprint arXiv:2308.13223, 2023.
  355. Z. Chen, F. Wang, and H. Liu, “Text-to-3d using gaussian splatting,” arXiv preprint arXiv:2309.16585, 2023.
  356. Y. Ma, Y. Fan, J. Ji, H. Wang, X. Sun, G. Jiang, A. Shu, and R. Ji, “X-dreamer: Creating high-quality 3d content by bridging the domain gap between text-to-2d and text-to-3d generation,” arXiv preprint arXiv:2312.00085, 2023.
  357. J. Wu, X. Gao, X. Liu, Z. Shen, C. Zhao, H. Feng, J. Liu, and E. Ding, “Hd-fusion: Detailed text-to-3d generation leveraging multiple noise estimation,” in Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2024, pp. 3202–3211.
  358. X. Yang, Y. Chen, C. Chen, C. Zhang, Y. Xu, X. Yang, F. Liu, and G. Lin, “Learn to optimize denoising scores for 3d generation: A unified and improved diffusion prior on nerf and 3d gaussian splatting,” arXiv preprint arXiv:2312.04820, 2023.
  359. F. Liu, D. Wu, Y. Wei, Y. Rao, and Y. Duan, “Sherpa3d: Boosting high-fidelity text-to-3d generation via coarse 3d prior,” 2023.
  360. Y. Lin, R. Clark, and P. Torr, “Dreampolisher: Towards high-quality text-to-3d generation via geometric diffusion,” arXiv preprint arXiv:2403.17237, 2024.
  361. Y. Yang, F.-Y. Sun, L. Weihs, E. VanderBilt, A. Herrasti, W. Han, J. Wu, N. Haber, R. Krishna, L. Liu, C. Callison-Burch, M. Yatskar, A. Kembhavi, and C. Clark, “Holodeck: Language guided generation of 3d embodied ai environments,” arXiv preprint arXiv:2312.09067, 2023.
  362. H. Song, S. Choi, H. Do, C. Lee, and T. Kim, “Blending-nerf: Text-driven localized editing in neural radiance fields,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 14 383–14 393.
  363. R. He, S. Huang, X. Nie, T. Hui, L. Liu, J. Dai, J. Han, G. Li, and S. Liu, “Customize your nerf: Adaptive source driven 3d scene editing via local-global iterative training,” arXiv preprint arXiv:2312.01663, 2023.
  364. X. Zeng, X. Chen, Z. Qi, W. Liu, Z. Zhao, Z. Wang, B. FU, Y. Liu, and G. Yu, “Paint3d: Paint anything 3d with lighting-less texture diffusion models,” 2023.
  365. E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of algorithms using games: The case of music tagging.” in ISMIR.   Citeseer, 2009, pp. 387–392.
  366. V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an asr corpus based on public domain audio books,” in 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2015, pp. 5206–5210.
  367. J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE international conference on acoustics, speech and signal processing (ICASSP).   IEEE, 2017, pp. 776–780.
  368. C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factorized piano music modeling and generation with the maestro dataset,” arXiv preprint arXiv:1810.12247, 2018.
  369. H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and Y. Wu, “Libritts: A corpus derived from librispeech for text-to-speech,” arXiv preprint arXiv:1904.02882, 2019.
  370. D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The mtg-jamendo dataset for automatic music tagging.”   ICML, 2019.
  371. J. Kahn, M. Riviere, W. Zheng, E. Kharitonov, Q. Xu, P.-E. Mazaré, J. Karadayi, V. Liptchinsky, R. Collobert, C. Fuegen et al., “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 7669–7673.
  372. H. Chen, W. Xie, A. Vedaldi, and A. Zisserman, “Vggsound: A large-scale audio-visual dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 721–725.
  373. B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng et al., “Wenetspeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2022, pp. 6182–6186.
  374. W. Kang, X. Yang, Z. Yao, F. Kuang, Y. Yang, L. Guo, L. Lin, and D. Povey, “Libriheavy: a 50,000 hours asr corpus with punctuation casing and context,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2024, pp. 10 991–10 995.
  375. J. Zhan, J. Dai, J. Ye, Y. Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan, G. Zhang, L. Li et al., “Anygpt: Unified multimodal llm with discrete sequence modeling,” arXiv preprint arXiv:2402.12226, 2024.
  376. H. Hao, L. Zhou, S. Liu, J. Li, S. Hu, R. Wang, and F. Wei, “Boosting large language model for speech synthesis: An empirical study,” arXiv preprint arXiv:2401.00246, 2023.
  377. J. Lu, C. Clark, S. Lee, Z. Zhang, S. Khosla, R. Marten, D. Hoiem, and A. Kembhavi, “Unified-io 2: Scaling autoregressive multimodal models with vision, language, audio, and action,” arXiv preprint arXiv:2312.17172, 2023.
  378. S. Bubeck, V. Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P. Lee, Y. T. Lee, Y. Li, S. Lundberg et al., “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023.
  379. J. Wu, Y. Gaur, Z. Chen, L. Zhou, Y. Zhu, T. Wang, J. Li, S. Liu, B. Ren, L. Liu et al., “On decoder-only architecture for speech-to-text and large language model integration,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU).   IEEE, 2023, pp. 1–8.
  380. W. Yu, C. Tang, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Connecting speech encoder and large language model for asr,” arXiv preprint arXiv:2309.13963, 2023.
  381. S. Wang, C.-H. H. Yang, J. Wu, and C. Zhang, “Can whisper perform speech-based in-context learning,” arXiv preprint arXiv:2309.07081, 2023.
  382. Y. Gong, Y.-A. Chung, and J. Glass, “Ast: Audio spectrogram transformer,” arXiv preprint arXiv:2104.01778, 2021.
  383. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, Y. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  384. R. Anil, A. M. Dai, O. Firat, M. Johnson, D. Lepikhin, A. Passos, S. Shakeri, E. Taropa, P. Bailey, Z. Chen et al., “Palm 2 technical report,” arXiv preprint arXiv:2305.10403, 2023.
  385. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  386. S. Kakouros, J. Šimko, M. Vainio, and A. Suni, “Investigating the utility of surprisal from large language models for speech synthesis prosody,” arXiv preprint arXiv:2306.09814, 2023.
  387. Y. Gong, A. Rouditchenko, A. H. Liu, D. Harwath, L. Karlinsky, H. Kuehne, and J. Glass, “Contrastive audio-visual masked autoencoder,” arXiv preprint arXiv:2210.07839, 2022.
  388. Z. Deng, Y. Ma, Y. Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos, “Musilingo: Bridging music and text with pre-trained language models for music captioning and query response,” arXiv preprint arXiv:2309.08730, 2023.
  389. Y. Li, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Lin, A. Ragni, E. Benetos, N. Gyenge et al., “Mert: Acoustic music understanding model with large-scale self-supervised training,” arXiv preprint arXiv:2306.00107, 2023.
  390. A. Défossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022.
  391. R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with improved rvqgan,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  392. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  393. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language models with self-generated instructions,” arXiv preprint arXiv:2212.10560, 2022.
  394. Y. Qin, S. Liang, Y. Ye, K. Zhu, L. Yan, Y. Lu, Y. Lin, X. Cong, X. Tang, B. Qian et al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” arXiv preprint arXiv:2307.16789, 2023.
  395. Q. Tang, Z. Deng, H. Lin, X. Han, Q. Liang, and L. Sun, “Toolalpaca: Generalized tool learning for language models with 3000 simulated cases,” arXiv preprint arXiv:2306.05301, 2023.
  396. T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv preprint arXiv:2302.04761, 2023.
  397. R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: Teaching large language model to use tools via self-instruction,” in Advances in Neural Information Processing Systems, 2023.
  398. N. Farn and R. Shin, “Tooltalk: Evaluating tool-usage in a conversation setting,” arXiv preprint arXiv:2311.10775, 2023.
  399. S. Hao, T. Liu, Z. Wang, and Z. Hu, “Toolkengpt: Augmenting frozen language models with massive tools via tool embeddings,” arXiv preprint arXiv:2305.11554, 2023.
  400. C.-Y. Hsieh, S.-A. Chen, C.-L. Li, Y. Fujii, A. Ratner, C.-Y. Lee, R. Krishna, and T. Pfister, “Tool documentation enables zero-shot tool-usage with large language models,” arXiv preprint arXiv:2308.00675, 2023.
  401. J. Ruan, Y. Chen, B. Zhang, Z. Xu, T. Bao, G. Du, S. Shi, H. Mao, X. Zeng, and R. Zhao, “Tptu: Task planning and tool usage of large language model-based ai agents,” arXiv preprint arXiv:2308.03427, 2023.
  402. Z. Liu, Z. Lai, Z. Gao, E. Cui, Z. Li, X. Zhu, L. Lu, Q. Chen, Y. Qiao, J. Dai, and W. Wang, “Controlllm: Augment language models with tools by searching on graphs,” arXiv preprint arXiv:2310.17796, 2023.
  403. A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv preprint arXiv:2205.12255, 2022.
  404. J. Zhang, “Graph-toolformer: To empower llms with graph reasoning ability via prompt augmented by chatgpt,” arXiv preprint arXiv:2304.11116, 2023.
  405. Y. Zhuang, X. Chen, T. Yu, S. Mitra, V. Bursztyn, R. A. Rossi, S. Sarkhel, and C. Zhang, “Toolchain*: Efficient action space navigation in large language models with a* search,” arXiv preprint arXiv:2310.13227, 2023.
  406. Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen, “Critic: Large language models can self-correct with tool-interactive critiquing,” arXiv preprint arXiv:2305.11738, 2023.
  407. Q. Jin, Y. Yang, Q. Chen, and Z. Lu, “Genegpt: Augmenting large language models with domain tools for improved access to biomedical information,” ArXiv, 2023.
  408. B. Paranjape, S. Lundberg, S. Singh, H. Hajishirzi, L. Zettlemoyer, and M. T. Ribeiro, “Art: Automatic multi-step reasoning and tool-use for large language models,” arXiv preprint arXiv:2303.09014, 2023.
  409. Z. Gou, Z. Shao, Y. Gong, Y. Yang, M. Huang, N. Duan, W. Chen et al., “Tora: A tool-integrated reasoning agent for mathematical problem solving,” arXiv preprint arXiv:2309.17452, 2023.
  410. Y. Song, W. Xiong, D. Zhu, C. Li, K. Wang, Y. Tian, and S. Li, “Restgpt: Connecting large language models with real-world applications via restful apis,” arXiv preprint arXiv:2306.06624, 2023.
  411. S. Qiao, H. Gui, H. Chen, and N. Zhang, “Making language models better tool learners with execution feedback,” arXiv preprint arXiv:2305.13068, 2023.
  412. K. Zhang, H. Chen, L. Li, and W. Wang, “Syntax error-free and generalizable tool use for llms via finite-state decoding,” arXiv preprint arXiv:2310.07075, 2023.
  413. C. Li, H. Chen, M. Yan, W. Shen, H. Xu, Z. Wu, Z. Zhang, W. Zhou, Y. Chen, C. Cheng et al., “Modelscope-agent: Building your customizable agent system with open-source large language models,” arXiv preprint arXiv:2309.00986, 2023.
  414. T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without training,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 14 953–14 962.
  415. C. Wang, W. Luo, Q. Chen, H. Mai, J. Guo, S. Dong, X. M. Xuan, Z. Li, L. Ma, and S. Gao, “Mllm-tool: A multimodal large language model for tool agent learning,” arXiv preprint arXiv:2401.10727, 2024.
  416. D. Surís, S. Menon, and C. Vondrick, “Vipergpt: Visual inference via python execution for reasoning,” Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
  417. Z. Gao, Y. Du, X. Zhang, X. Ma, W. Han, S.-C. Zhu, and Q. Li, “Clova: A closed-loop visual assistant with tool usage and update,” arXiv preprint arXiv:2312.10908, 2023.
  418. S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao, “ReAct: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023.
  419. J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V. Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022.
  420. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv preprint arXiv:2304.02643, 2023.
  421. S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin, T. Mihaylov, M. Ott, S. Shleifer, K. Shuster, D. Simig, P. S. Koura, A. Sridhar, T. Wang, and L. Zettlemoyer, “Opt: Open pre-trained transformer language models,” 2022.
  422. T. Dettmers, A. Pagnoni, A. Holtzman, and L. Zettlemoyer, “Qlora: Efficient finetuning of quantized llms,” arXiv preprint arXiv:2305.14314, 2023.
  423. S. Mangrulkar, S. Gugger, L. Debut, Y. Belkada, S. Paul, and B. Bossan, “Peft: State-of-the-art parameter-efficient fine-tuning methods,” https://github.com/huggingface/peft, 2022.
  424. J. Chen, X. Li, X. Ye, C. Li, Z. Fan, and H. Zhao, “Idea-2-3d: Collaborative lmm agents enable 3d model generation from interleaved multimodal inputs,” arXiv preprint arXiv:2404.04363, 2024.
  425. E. Wallace, S. Feng, N. Kandpal, M. Gardner, and S. Singh, “Universal adversarial triggers for attacking and analyzing nlp,” arXiv preprint arXiv:1908.07125, 2019.
  426. X. Fu, Z. Wang, S. Li, R. K. Gupta, N. Mireshghallah, T. Berg-Kirkpatrick, and E. Fernandes, “Misusing tools in large language models with visual adversarial examples,” arXiv preprint arXiv:2310.03185, 2023.
  427. L. Bailey, E. Ong, S. Russell, and S. Emmons, “Image hijacks: Adversarial images can control generative models at runtime,” arXiv preprint arXiv:2309.00236, 2023.
  428. A. Zou, Z. Wang, J. Z. Kolter, and M. Fredrikson, “Universal and transferable adversarial attacks on aligned language models,” arXiv preprint arXiv:2307.15043, 2023.
  429. E. Jones, A. Dragan, A. Raghunathan, and J. Steinhardt, “Automatically auditing large language models via discrete optimization,” in International Conference on Machine Learning.   PMLR, 2023, pp. 15 307–15 329.
  430. P. Żelasko, S. Joshi, Y. Shao, J. Villalba, J. Trmal, N. Dehak, and S. Khudanpur, “Adversarial attacks and defenses for speech recognition systems,” arXiv preprint arXiv:2103.17122, 2021.
  431. Z. Chen, L. Xie, S. Pang, Y. He, and Q. Tian, “Appending adversarial frames for universal video attack,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3199–3208.
  432. H. Liu, W. Zhou, D. Chen, H. Fang, H. Bian, K. Liu, W. Zhang, and N. Yu, “Coherent adversarial deepfake video generation,” Signal Processing, vol. 203, p. 108790, 2023.
  433. S.-Y. Lo and V. M. Patel, “Defending against multiple and unforeseen adversarial videos,” IEEE Transactions on Image Processing, vol. 31, pp. 962–973, 2021.
  434. H. J. Lee and Y. M. Ro, “Defending video recognition model against adversarial perturbations via defense patterns,” IEEE Transactions on Dependable and Secure Computing, 2023.
  435. Y. Wu, X. Li, Y. Liu, P. Zhou, and L. Sun, “Jailbreaking gpt-4v via self-adversarial attacks with system prompts,” arXiv preprint arXiv:2311.09127, 2023.
  436. Y. Xie, J. Yi, J. Shao, J. Curl, L. Lyu, Q. Chen, X. Xie, and F. Wu, “Defending chatgpt against jailbreak attack via self-reminders,” Nature Machine Intelligence, vol. 5, no. 12, pp. 1486–1496, 2023.
  437. Y. Liu, G. Deng, Y. Li, K. Wang, T. Zhang, Y. Liu, H. Wang, Y. Zheng, and Y. Liu, “Prompt injection attack against llm-integrated applications,” arXiv preprint arXiv:2306.05499, 2023.
  438. F. Perez and I. Ribeiro, “Ignore previous prompt: Attack techniques for language models,” arXiv preprint arXiv:2211.09527, 2022.
  439. N. Carlini, M. Jagielski, C. A. Choquette-Choo, D. Paleka, W. Pearce, H. Anderson, A. Terzis, K. Thomas, and F. Tramèr, “Poisoning web-scale training datasets is practical,” arXiv preprint arXiv:2302.10149, 2023.
  440. R. Jia and P. Liang, “Adversarial examples for evaluating reading comprehension systems,” arXiv preprint arXiv:1707.07328, 2017.
  441. M.-H. Van and X. Wu, “Detecting and correcting hate speech in multimodal memes with large visual language model,” arXiv preprint arXiv:2311.06737, 2023.
  442. Z. Wei, Y. Wang, and Y. Wang, “Jailbreak and guard aligned language models with only few in-context demonstrations,” arXiv preprint arXiv:2310.06387, 2023.
  443. A. Robey, E. Wong, H. Hassani, and G. J. Pappas, “Smoothllm: Defending large language models against jailbreaking attacks,” arXiv preprint arXiv:2310.03684, 2023.
  444. R. Liu, A. Khakzar, J. Gu, Q. Chen, P. Torr, and F. Pizzati, “Latent guard: a safety framework for text-to-image generation,” arXiv preprint arXiv:2404.08031, 2024.
  445. J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017.
  446. R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  447. R. Pi, T. Han, W. Xiong, J. Zhang, R. Liu, R. Pan, and T. Zhang, “Strengthening multimodal large language model with bootstrapped preference optimization,” arXiv preprint arXiv:2403.08730, 2024.
  448. X. Wu, K. Sun, F. Zhu, R. Zhao, and H. Li, “Better aligning text-to-image models with human preference,” arXiv preprint arXiv:2303.14420, 2023.
  449. H. Dong, W. Xiong, D. Goyal, R. Pan, S. Diao, J. Zhang, K. Shum, and T. Zhang, “Raft: Reward ranked finetuning for generative foundation model alignment,” arXiv preprint arXiv:2304.06767, 2023.
  450. P. Korshunov and S. Marcel, “Deepfakes: A new threat to face recognition? assessment and detection. arxiv 2018,” arXiv preprint arXiv:1812.08685.
  451. Y. Mirsky and W. Lee, “The creation and detection of deepfakes: A survey,” ACM computing surveys (CSUR), vol. 54, no. 1, pp. 1–41, 2021.
  452. M. Masood, M. Nawaz, K. M. Malik, A. Javed, A. Irtaza, and H. Malik, “Deepfakes generation and detection: State-of-the-art, open challenges, countermeasures, and way forward,” Applied intelligence, vol. 53, no. 4, pp. 3974–4026, 2023.
  453. L. Verdoliva, “Media forensics and deepfakes: an overview,” IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 910–932, 2020.
  454. D. Wodajo, S. Atnafu, and Z. Akhtar, “Deepfake video detection using generative convolutional vision transformer,” arXiv preprint arXiv:2307.07036, 2023.
  455. D. Wodajo and S. Atnafu, “Deepfake video detection using convolutional vision transformer,” arXiv preprint arXiv:2102.11126, 2021.
  456. S. Hussain, P. Neekhara, M. Jere, F. Koushanfar, and J. McAuley, “Adversarial deepfakes: Evaluating vulnerability of deepfake detectors to adversarial examples,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 3348–3357.
  457. W. Shi, A. Ajith, M. Xia, Y. Huang, D. Liu, T. Blevins, D. Chen, and L. Zettlemoyer, “Detecting pretraining data from large language models,” arXiv preprint arXiv:2310.16789, 2023.
  458. S. M. Park, K. Georgiev, A. Ilyas, G. Leclerc, and A. Madry, “Trak: Attributing model behavior at scale,” arXiv preprint arXiv:2303.14186, 2023.
  459. Z. Wang, C. Chen, Y. Zeng, L. Lyu, and S. Ma, “Where did i come from? origin attribution of ai-generated images,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  460. J. Kirchenbauer, J. Geiping, Y. Wen, J. Katz, I. Miers, and T. Goldstein, “A watermark for large language models,” in International Conference on Machine Learning.   PMLR, 2023, pp. 17 061–17 084.
  461. Y. Cui, J. Ren, H. Xu, P. He, H. Liu, L. Sun, and J. Tang, “Diffusionshield: A watermark for copyright protection against generative diffusion models,” arXiv preprint arXiv:2306.04642, 2023.
  462. P. Fernandez, G. Couairon, H. Jégou, M. Douze, and T. Furon, “The stable signature: Rooting watermarks in latent diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 22 466–22 477.
  463. Z. Zhang, L. Lei, L. Wu, R. Sun, Y. Huang, C. Long, X. Liu, X. Lei, J. Tang, and M. Huang, “Safetybench: Evaluating the safety of large language models with multiple choice questions,” arXiv preprint arXiv:2309.07045, 2023.
  464. H. Lin, Z. Luo, B. Wang, R. Yang, and J. Ma, “Goat-bench: Safety insights to large multimodal models through meme-based social abuse,” arXiv preprint arXiv:2401.01523, 2024.
  465. X. Wang, X. Yi, H. Jiang, S. Zhou, Z. Wei, and X. Xie, “Tovilag: Your visual-language generative model is also an evildoer,” arXiv preprint arXiv:2312.11523, 2023.
  466. Y. Gong, D. Ran, J. Liu, C. Wang, T. Cong, A. Wang, S. Duan, and X. Wang, “Figstep: Jailbreaking large vision-language models via typographic visual prompts,” arXiv preprint arXiv:2311.05608, 2023.
  467. X. Liu, Y. Zhu, Y. Lan, C. Yang, and Y. Qiao, “Query-relevant images jailbreak large multi-modal models,” arXiv preprint arXiv:2311.17600, 2023.
  468. “midjourney,” https://www.midjourney.com/home.
  469. “Stability ai,” https://stability.ai/.
  470. “Gpt-4,” https://openai.com/gpt-4.
  471. “Dalle-2,” https://openai.com/dall-e-2.
  472. “Openai,” https://openai.com.
  473. “Pika labs,” https://www.pika.art/.
  474. “Gen2,” https://research.runwayml.com/gen2.
  475. “heygen,” https://app.heygen.com/home.
  476. “Azure ai-services: text-to-speech,” https://azure.microsoft.com/zh-cn/products/ai-services/text-to-speech.
  477. “descript,” https://www.descript.com/.
  478. “Suno ai,” https://suno-ai.org/.
  479. “Stability ai: Stable audio,” https://stability.ai/stable-audio.
  480. “Musicfx,” https://aitestkitchen.withgoogle.com/tools/music-fx.
  481. “tuneflow,” https://www.tuneflow.com/.
  482. “deepmusic,” https://www.deepmusic.fun/.
  483. “meta,” https://about.meta.com/.
  484. “Epic games’ metahuman creator,” https://www.unrealengine.com/en-US/metahuman.
  485. “Luma ai,” https://lumalabs.ai/.
  486. “Adobe,” https://www.adobe.com/.
  487. “Kaedim3d,” https://www.kaedim3d.com/.
  488. “Wonder studio,” https://wonderdynamics.com/.
  489. A. Avetisyan, C. Xie, H. Howard-Jenkins, T.-Y. Yang, S. Aroudj, S. Patra, F. Zhang, D. Frost, L. Holland, C. Orme, J. Engel, E. Miller, R. Newcombe, and V. Balntas, “Scenescript: Reconstructing scenes with an autoregressive structured language model,” 2024.
  490. “google,” https://www.google.com/.
  491. “tencent,” https://www.tencent.com/.
  492. Y. He, S. Yang, H. Chen, X. Cun, M. Xia, Y. Zhang, X. Wang, R. He, Q. Chen, and Y. Shan, “Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models,” in The Twelfth International Conference on Learning Representations, 2023.
  493. L. Guo, Y. He, H. Chen, M. Xia, X. Cun, Y. Wang, S. Huang, Y. Zhang, X. Wang, Q. Chen et al., “Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation,” arXiv preprint arXiv:2402.10491, 2024.
  494. Y. Xu, T. Park, R. Zhang, Y. Zhou, E. Shechtman, F. Liu, J.-B. Huang, and D. Liu, “Videogigagan: Towards detail-rich video super-resolution,” arXiv preprint arXiv:2404.12388, 2024.
  495. S. Zhou, P. Yang, J. Wang, Y. Luo, and C. C. Loy, “Upscale-a-video: Temporal-consistent diffusion model for real-world video super-resolution,” arXiv preprint arXiv:2312.06640, 2023.
  496. R. S. Roman, Y. Adi, A. Deleforge, R. Serizel, G. Synnaeve, and A. Défossez, “From discrete tokens to high-fidelity audio using multi-band diffusion,” arXiv preprint arXiv:2308.02560, 2023.
  497. Y. Yao, P. Li, B. Chen, and A. Wang, “Jen-1 composer: A unified framework for high-fidelity multi-track music generation,” arXiv preprint arXiv:2310.19180, 2023.
  498. M. Ding, W. Zheng, W. Hong, and J. Tang, “Cogview2: Faster and better text-to-image generation via hierarchical transformers,” Advances in Neural Information Processing Systems, vol. 35, pp. 16 890–16 902, 2022.
  499. J. Chen, J. Yu, C. Ge, L. Yao, E. Xie, Y. Wu, Z. Wang, J. Kwok, P. Luo, H. Lu et al., “Pixart-α𝛼\alphaitalic_α: Fast training of diffusion transformer for photorealistic text-to-image synthesis,” arXiv preprint arXiv:2310.00426, 2023.
  500. Y. Zhang, Y. Wei, X. Lin, Z. Hui, P. Ren, X. Xie, X. Ji, and W. Zuo, “Videoelevator: Elevating video generation quality with versatile text-to-image diffusion models,” arXiv preprint arXiv:2403.05438, 2024.
  501. R. Henschel, L. Khachatryan, D. Hayrapetyan, H. Poghosyan, V. Tadevosyan, Z. Wang, S. Navasardyan, and H. Shi, “Streamingt2v: Consistent, dynamic, and extendable long video generation from text,” arXiv preprint arXiv:2403.14773, 2024.
  502. R. Or-El, X. Luo, M. Shan, E. Shechtman, J. J. Park, and I. Kemelmacher-Shlizerman, “Stylesdf: High-resolution 3d-consistent image and geometry generation,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 13 503–13 513.
  503. X. Huang, W. Li, J. Hu, H. Chen, and Y. Wang, “Refsr-nerf: Towards high fidelity and super resolution view synthesis,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 8244–8253.
  504. F.-Y. Wang, W. Chen, G. Song, H.-J. Ye, Y. Liu, and H. Li, “Gen-l-video: Multi-text to long video generation via temporal co-denoising,” arXiv preprint arXiv:2305.18264, 2023.
  505. J. Yoo, S. Kim, D. Lee, C. Kim, and S. Hong, “Towards end-to-end generative modeling of long videos with memory-efficient bidirectional transformers,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 22 888–22 897.
  506. L. Lin, G. Xia, Y. Zhang, and J. Jiang, “Arrange, inpaint, and refine: Steerable long-term music audio generation and editing via content-based controls,” 2024.
  507. L. Zhang, A. Rao, and M. Agrawala, “Adding conditional control to text-to-image diffusion models,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 3836–3847.
  508. M. Zhao, R. Wang, F. Bao, C. Li, and J. Zhu, “Controlvideo: Adding conditional control for one shot text-to-video editing,” arXiv preprint arXiv:2305.17098, 2023.
  509. R. Liu, D. Garrette, C. Saharia, W. Chan, A. Roberts, S. Narang, I. Blok, R. Mical, M. Norouzi, and N. Constant, “Character-aware models improve visual text rendering,” arXiv preprint arXiv:2212.10562, 2022.
  510. J. Ma, M. Zhao, C. Chen, R. Wang, D. Niu, H. Lu, and X. Lin, “Glyphdraw: Learning to draw chinese characters in image synthesis models coherently,” arXiv preprint arXiv:2303.17870, 2023.
  511. C. Chen, X. Yang, F. Yang, C. Feng, Z. Fu, C.-S. Foo, G. Lin, and F. Liu, “Sculpt3d: Multi-view consistent text-to-3d generation with sparse 3d prior,” arXiv preprint arXiv:2403.09140, 2024.
  512. S. Woo, B. Park, H. Go, J.-Y. Kim, and C. Kim, “Harmonyview: Harmonizing consistency and diversity in one-image-to-3d,” arXiv preprint arXiv:2312.15980, 2023.
  513. J. Ye, P. Wang, K. Li, Y. Shi, and H. Wang, “Consistent-1-to-3: Consistent image to 3d view synthesis via geometry-aware diffusion models,” arXiv preprint arXiv:2310.03020, 2023.
  514. Q. Zuo, X. Gu, L. Qiu, Y. Dong, Z. Zhao, W. Yuan, R. Peng, S. Zhu, Z. Dong, L. Bo et al., “Videomv: Consistent multi-view generation based on large video generative model,” arXiv preprint arXiv:2403.12010, 2024.
  515. P. Wang, S. Wang, J. Lin, S. Bai, X. Zhou, J. Zhou, X. Wang, and C. Zhou, “One-peace: Exploring one general representation model toward unlimited modalities,” arXiv preprint arXiv:2305.11172, 2023.
  516. C. Boletsis, A. Lie, O. Prillard, K. Husby, and J. Li, “The invizar project: Augmented reality visualization for non-destructive testing data from jacket platforms,” 2023.
  517. S. Chen, H. Li, Q. Wang, Z. Zhao, M. Sun, X. Zhu, and J. Liu, “Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset,” Advances in Neural Information Processing Systems, vol. 36, 2024.
  518. J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Radford, J. Wu, and D. Amodei, “Scaling laws for neural language models,” arXiv preprint arXiv:2001.08361, 2020.
  519. P. Micikevicius, S. Narang, J. Alben, G. Diamos, E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh et al., “Mixed precision training,” arXiv preprint arXiv:1710.03740, 2017.
  520. B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard, H. Adam, and D. Kalenichenko, “Quantization and training of neural networks for efficient integer-arithmetic-only inference,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2704–2713.
  521. T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, 2022.
  522. Y. Choukroun, E. Kravchik, F. Yang, and P. Kisilev, “Low-bit quantization of neural networks for efficient inference,” in 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).   IEEE, 2019, pp. 3009–3018.
  523. X. Wu, C. Li, R. Y. Aminabadi, Z. Yao, and Y. He, “Understanding int4 quantization for language models: latency speedup, composability, and failure cases,” in International Conference on Machine Learning.   PMLR, 2023, pp. 37 524–37 539.
  524. Y. Qu, X. Shen, X. He, M. Backes, S. Zannettou, and Y. Zhang, “Unsafe diffusion: On the generation of unsafe images and hateful memes from text-to-image models,” in Proceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, 2023, pp. 3403–3417.
  525. D. Ha and J. Schmidhuber, “World models,” arXiv preprint arXiv:1803.10122, 2018.
  526. H. Liu, W. Yan, M. Zaharia, and P. Abbeel, “World model on million-length video and language with ringattention,” arXiv preprint arXiv:2402.08268, 2024.
  527. Y. LeCun, “A path towards autonomous machine intelligence version 0.9. 2, 2022-06-27,” Open Review, vol. 62, no. 1, 2022.
  528. C. Min, D. Zhao, L. Xiao, Y. Nie, and B. Dai, “Uniworld: Autonomous driving pre-training via world models,” arXiv preprint arXiv:2308.07234, 2023.
  529. D. Hafner, J. Pasukonis, J. Ba, and T. Lillicrap, “Mastering diverse domains through world models,” arXiv preprint arXiv:2301.04104, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (16)
  1. Yingqing He (23 papers)
  2. Zhaoyang Liu (42 papers)
  3. Jingye Chen (16 papers)
  4. Zeyue Tian (12 papers)
  5. Hongyu Liu (208 papers)
  6. Xiaowei Chi (20 papers)
  7. Runtao Liu (16 papers)
  8. Ruibin Yuan (43 papers)
  9. Yazhou Xing (10 papers)
  10. Wenhai Wang (123 papers)
  11. Jifeng Dai (131 papers)
  12. Yong Zhang (660 papers)
  13. Wei Xue (149 papers)
  14. Qifeng Liu (28 papers)
  15. Yike Guo (144 papers)
  16. Qifeng Chen (187 papers)
Citations (9)