LLMs Meet Multimodal Generation and Editing: A Survey (2405.19334v2)

Published 29 May 2024 in cs.AI, cs.CL, cs.CV, cs.MM, and cs.SD

Abstract: With the recent advancement in LLMs, there is a growing interest in combining LLMs with multimodal learning. Previous surveys of multimodal LLMs (MLLMs) mainly focus on multimodal understanding. This survey elaborates on multimodal generation and editing across various domains, comprising image, video, 3D, and audio. Specifically, we summarize the notable advancements with milestone works in these fields and categorize these studies into LLM-based and CLIP/T5-based methods. Then, we summarize the various roles of LLMs in multimodal generation and exhaustively investigate the critical technical components behind these methods and the multimodal datasets utilized in these studies. Additionally, we dig into tool-augmented multimodal agents that can leverage existing generative models for human-computer interaction. Lastly, we discuss the advancements in the generative AI safety field, investigate emerging applications, and discuss future prospects. Our work provides a systematic and insightful overview of multimodal generation and processing, which is expected to advance the development of Artificial Intelligence for Generative Content (AIGC) and world models. A curated list of all related papers can be found at https://github.com/YingqingHe/Awesome-LLMs-meet-Multimodal-Generation

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (529)

Authors (16)

Yingqing He (23 papers)
Zhaoyang Liu (42 papers)
Jingye Chen (16 papers)
Zeyue Tian (12 papers)
Hongyu Liu (208 papers)
Xiaowei Chi (20 papers)
Runtao Liu (16 papers)
Ruibin Yuan (43 papers)
Yazhou Xing (10 papers)
Wenhai Wang (123 papers)
Jifeng Dai (131 papers)
Yong Zhang (660 papers)
Wei Xue (149 papers)
Qifeng Liu (28 papers)
Yike Guo (144 papers)
Qifeng Chen (187 papers)

Citations (9)

View on Semantic Scholar

Tweets

https://twitter.com/morris_phd/status/1802077878827557323

https://twitter.com/ai_arxiv/status/1796015025972445301

https://twitter.com/knishimae0531/status/1796506385066471932

https://twitter.com/gm8xx8/status/1796009671934779658

LLMs Meet Multimodal Generation and Editing: A Survey (2405.19334v2)

Related Papers

Tweets