Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
106 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MM-LLMs: Recent Advances in MultiModal Large Language Models (2401.13601v5)

Published 24 Jan 2024 in cs.CL
MM-LLMs: Recent Advances in MultiModal Large Language Models

Abstract: In the past year, MultiModal LLMs (MM-LLMs) have undergone substantial advancements, augmenting off-the-shelf LLMs to support MM inputs or outputs via cost-effective training strategies. The resulting models not only preserve the inherent reasoning and decision-making capabilities of LLMs but also empower a diverse range of MM tasks. In this paper, we provide a comprehensive survey aimed at facilitating further research of MM-LLMs. Initially, we outline general design formulations for model architecture and training pipeline. Subsequently, we introduce a taxonomy encompassing 126 MM-LLMs, each characterized by its specific formulations. Furthermore, we review the performance of selected MM-LLMs on mainstream benchmarks and summarize key training recipes to enhance the potency of MM-LLMs. Finally, we explore promising directions for MM-LLMs while concurrently maintaining a real-time tracking website for the latest developments in the field. We hope that this survey contributes to the ongoing advancement of the MM-LLMs domain.

Introduction

The field of MultiModal LLMs (MM-LLMs) has seen significant expansion, leveraging pre-trained unimodal models to mitigate the computational costs associated with training from scratch. These models not only excel in natural language understanding and generation but also in processing and generating MultiModal (MM) content, thus advancing closer to artificial general intelligence.

Architectural Composition and Training Pipeline

MM-LLMs are composed of five architectural elements: Modality Encoder, Input Projector, LLM Backbone, Output Projector, and Modality Generator. The diverse modalities processed by these components underscore the complexity and capability of MM-LLMs. The training pipeline is split into MM Pre-Training (PT) and MM Instruction-Tuning (IT), focusing on enhancing the LLM's textual abilities to support MM input/output. A notable shift in the field is the realigned focus on training strategies that optimize model efficiency, given the exorbitant costs associated with training MM-LLMs.

State-of-the-Art Models

A broad spectrum of MM-LLMs, each with unique features, has been introduced to address various MM tasks. Models like Flamingo and BLIP-2 emphasize MM understanding and exhibit text generation prompted by natural language. Conversely, other models like MiniGPT-4 and MiniGPT-5 demonstrate capabilities of both input and output in multiple modalities. The evolution of technology has led to models with innovative structures such as NExT-GPT and CoDi-2, which attempt to create end-to-end MM systems without relying on a cascade of processes.

Benchmarks and Emerging Directions

The performance assessment of MM-LLMs has been standardized across numerous mainstream benchmarks, providing insight into the models' effectiveness and guiding future enhancements. Future trajectories for MM-LLMs encompass augmentation in modalities and LLMs, improvement in datasets, and progression toward any-to-any modality conversion. Furthermore, more comprehensive, practical, and challenging benchmarks are called upon to thoroughly evaluate MM-LLMs. Additionally, potent directions such as deploying lightweight models, amalgamating embodied intelligence, and advancing continual IT depict a roadmap for future research endeavors.

Understanding the intricate interplay between different modalities and harnessing the power of pre-existing LLMs, MM-LLMs continue to revolutionize the capabilities of AI systems, drawing them ever closer to mimicking human intelligence within computational limitations. This survey serves as an essential compass for researchers navigating the MM-LLMs landscape, marking pathways to uncharted terrains that await exploration.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (191)
  1. Jointly Training Large Autoregressive Multimodal Models. arXiv preprint arXiv:2309.15564.
  2. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems, 34:24206–24221.
  3. RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs. arXiv preprint arXiv:2305.08844.
  4. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736.
  5. Retrieval-based language models and applications. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 6: Tutorial Abstracts), pages 41–46.
  6. Openflamingo: An open-source framework for training large autoregressive vision-language models. arXiv preprint arXiv:2308.01390.
  7. Qwen technical report. arXiv preprint arXiv:2309.16609.
  8. Qwen-VL: A Frontier Large Vision-Language Model with Versatile Abilities. CoRR, abs/2308.12966.
  9. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1728–1738.
  10. Introducing our Multimodal Models.
  11. Latr: Layout-aware transformer for scene-text vqa. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16548–16558.
  12. High-performance large-scale image recognition without normalization. In International Conference on Machine Learning, pages 1059–1071. PMLR.
  13. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901.
  14. Coyo-700m: Image-text pair dataset.
  15. Activitynet: A large-scale video benchmark for human activity understanding. In Proceedings of the ieee conference on computer vision and pattern recognition, pages 961–970.
  16. Cerspense. 2023. Zeroscope: Diffusion-based text-to-video synthesis.
  17. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3558–3568.
  18. Vlp: A survey on vision-language pre-training. Machine Intelligence Research, 20(1):38–56.
  19. X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages. arXiv preprint arXiv:2305.04160.
  20. Minigpt-v2: large language model as a unified interface for vision-language multi-task learning. arXiv preprint arXiv:2310.09478.
  21. Shikra: Unleashing Multimodal LLM’s Referential Dialogue Magic. arXiv preprint arXiv:2306.15195.
  22. ShareGPT4V: Improving Large Multi-Modal Models with Better Captions. arXiv preprint arXiv:2311.12793.
  23. BEATs: Audio Pre-Training with Acoustic Tokenizers. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 5178–5193.
  24. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678.
  25. PaLI-X: On Scaling up a Multilingual Vision and Language Model. arXiv preprint arXiv:2305.18565.
  26. Pali: A jointly-scaled multilingual language-image model. arXiv preprint arXiv:2209.06794.
  27. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325.
  28. Dress: Instructing large vision-language models to align and interact with humans via natural language feedback. arXiv preprint arXiv:2311.10081.
  29. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality.
  30. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113.
  31. Mobilevlm: A fast, reproducible and strong vision language assistant for mobile devices. arXiv preprint arXiv:2312.16886.
  32. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
  33. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  34. A survey on multimodal large language models for autonomous driving. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 958–979.
  35. InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  36. Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
  37. Heterogeneous forgetting compensation for class-incremental learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11742–11751.
  38. Federated Incremental Semantic Segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3934–3943.
  39. Cif: Continuous integrate-and-fire for end-to-end speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6079–6083. IEEE.
  40. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
  41. Palm-e: An embodied multimodal language model. arXiv preprint arXiv:2303.03378.
  42. A Survey of Vision-Language Pre-Trained Models. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5436–5443.
  43. GLM: General Language Model Pretraining with Autoregressive Blank Infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  44. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
  45. Eva: Exploring the limits of masked visual representation learning at scale. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19358–19369.
  46. DocPedia: Unleashing the Power of Large Multimodal Model in the Frequency Domain for Versatile Document Understanding. arXiv preprint arXiv:2311.11810.
  47. Foundation Models in Robotics: Applications, Challenges, and the Future. arXiv preprint arXiv:2312.07843.
  48. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394.
  49. AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2608–2621.
  50. Datacomp: In search of the next generation of multimodal datasets. arXiv preprint arXiv:2304.14108.
  51. Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190.
  52. Multimodal-gpt: A vision and language model for dialogue with humans. arXiv preprint arXiv:2305.04790.
  53. An empirical investigation of catastrophic forgetting in gradient-based neural networks. arXiv preprint arXiv:1312.6211.
  54. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913.
  55. Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark. Advances in Neural Information Processing Systems, 35:26418–26431.
  56. Vizwiz grand challenge: Answering visual questions from blind people. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3608–3617.
  57. Continual instruction tuning for large multimodal models. arXiv preprint arXiv:2311.16206.
  58. Towards a Unified View of Parameter-Efficient Transfer Learning. In International Conference on Learning Representations.
  59. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778.
  60. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.
  61. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689.
  62. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, pages 2790–2799. PMLR.
  63. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  64. LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations.
  65. Visual Instruction Tuning towards General-Purpose Multimodal Model: A Survey. arXiv preprint arXiv:2312.16602.
  66. Audiogpt: Understanding and generating speech, music, sound, and talking head. arXiv preprint arXiv:2304.12995.
  67. Language is not all you need: Aligning perception with language models. arXiv preprint arXiv:2302.14045.
  68. Drew A Hudson and Christopher D Manning. 2019. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709.
  69. IDEFICS. 2023. Introducing IDEFICS: An Open Reproduction of State-of-the-Art Visual Language Model.
  70. Opt-iml: Scaling language model instruction meta learning through the lens of generalization. arXiv preprint arXiv:2212.12017.
  71. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR.
  72. Bootstrapping Vision-Language Learning with Decoupled Language Pre-training. In Thirty-seventh Conference on Neural Information Processing Systems.
  73. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656.
  74. Compacter: Efficient low-rank hypercomplex adapter layers. Advances in Neural Information Processing Systems, 34:1022–1035.
  75. Referitgame: Referring to objects in photographs of natural scenes. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798.
  76. Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of NAACL-HLT, pages 4171–4186.
  77. The hateful memes challenge: Detecting hate speech in multimodal memes. Advances in neural information processing systems, 33:2611–2624.
  78. Diederik P Kingma and Max Welling. 2013. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.
  79. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International journal of computer vision, 123:32–73.
  80. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059.
  81. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425.
  82. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125.
  83. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 19730–19742.
  84. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, pages 12888–12900. PMLR.
  85. Align before fuse: Vision and language representation learning with momentum distillation. Advances in neural information processing systems, 34:9694–9705.
  86. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355.
  87. M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTIT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning. arXiv preprint arXiv:2306.04387.
  88. Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning: Optimizing Continuous Prompts for Generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597.
  89. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16, pages 121–137. Springer.
  90. Stablellava: Enhanced visual instruction tuning with synthesized image-dialogue data. arXiv preprint arXiv:2308.10253.
  91. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355.
  92. Monkey: Image Resolution and Text Label Are Important Things for Large Multi-modal Models. arXiv preprint arXiv:2311.06607.
  93. GOAT-Bench: Safety Insights to Large Multimodal Models through Meme-Based Social Abuse. arXiv preprint arXiv:2401.01523.
  94. VILA: On Pre-training for Visual Language Models. arXiv preprint arXiv:2312.07533.
  95. Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer.
  96. Visual spatial reasoning. Transactions of the Association for Computational Linguistics, 11:635–651.
  97. AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 21450–21474.
  98. AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. CoRR, abs/2308.05734.
  99. Improved Baselines with Visual Instruction Tuning. In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following.
  100. Visual Instruction Tuning. In Thirty-seventh Conference on Neural Information Processing Systems.
  101. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281.
  102. Vision-and-Language Pretrained Models: A Survey. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5530–5537.
  103. Learn to explain: Multimodal reasoning via thought chains for science question answering. Advances in Neural Information Processing Systems, 35:2507–2521.
  104. Iconqa: A new benchmark for abstract diagram understanding and visual language reasoning. In Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2).
  105. WizardCoder: Empowering Code Large Language Models with Evol-Instruct. arXiv preprint arXiv:2306.08568.
  106. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424.
  107. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209.
  108. Michael McCloskey and Neal J Cohen. 1989. Catastrophic interference in connectionist networks: The sequential learning problem. In Psychology of learning and motivation, volume 24, pages 109–165. Elsevier.
  109. Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research. arXiv preprint arXiv:2303.17395.
  110. Ocr-vqa: Visual question answering by reading text in images. In 2019 international conference on document analysis and recognition (ICDAR), pages 947–952. IEEE.
  111. Embodiedgpt: Vision-language pre-training via embodied chain of thought. In Thirty-seventh Conference on Neural Information Processing Systems.
  112. A comprehensive overview of large language models. arXiv preprint arXiv:2307.06435.
  113. OpenAI. 2022. OpenAI: Introducing ChatGPT.
  114. OpenAI. 2023. GPT-4 Technical Report.
  115. Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems, 24.
  116. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744.
  117. X-InstructBLIP: A Framework for aligning X-Modal instruction-aware representations to LLMs and Emergent Cross-modal Reasoning. arXiv preprint arXiv:2311.18799.
  118. Kosmos-2: Grounding Multimodal Large Language Models to the World. arXiv preprint arXiv:2306.14824.
  119. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  120. Robust Speech Recognition via Large-Scale Weak Supervision. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, pages 28492–28518.
  121. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  122. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30.
  123. Anthony Robins. 1995. Catastrophic forgetting, rehearsal and pseudorehearsal. Connection Science, 7(2):123–146.
  124. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695.
  125. U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pages 234–241. Springer.
  126. Ludan Ruan and Qin Jin. 2022. Survey: Transformer based video-language pre-training. AI Open, 3:1–13.
  127. Salesforce. 2022. Ulip.
  128. Laion-5b: An open large-scale dataset for training next generation image-text models. Advances in Neural Information Processing Systems, 35:25278–25294.
  129. Laion coco: 600m synthetic captions from laion2b-en.
  130. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114.
  131. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer.
  132. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565.
  133. Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
  134. Textcaps: a dataset for image captioning with reading comprehension. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pages 742–758. Springer.
  135. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326.
  136. How to Bridge the Gap between Modalities: A Comprehensive Survey on Multimodal Large Language Model. arXiv preprint arXiv:2311.07594.
  137. Pandagpt: One model to instruction-follow them all. arXiv preprint arXiv:2305.16355.
  138. Aligning large multimodal models with factually augmented rlhf. arXiv preprint arXiv:2309.14525.
  139. Vipergpt: Visual inference via python execution for reasoning. arXiv preprint arXiv:2303.08128.
  140. CoDi-2: In-Context, Interleaved, and Interactive Any-to-Any Generation. arXiv preprint arXiv:2311.18775.
  141. Any-to-Any Generation via Composable Diffusion. In Thirty-seventh Conference on Neural Information Processing Systems.
  142. Ul2: Unifying language learning paradigms. In The Eleventh International Conference on Learning Representations.
  143. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805.
  144. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  145. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  146. Attention is all you need. Advances in neural information processing systems, 30.
  147. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. In International Conference on Machine Learning, pages 23318–23340. PMLR.
  148. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079.
  149. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
  150. Finetuned Language Models are Zero-Shot Learners. In International Conference on Learning Representations.
  151. Visual chatgpt: Talking, drawing and editing with visual foundation models. arXiv preprint arXiv:2303.04671.
  152. Q-bench: A benchmark for general-purpose foundation models on low-level vision. arXiv preprint arXiv:2309.14181.
  153. Ai challenger: A large-scale dataset for going deeper in image understanding. arXiv preprint arXiv:1711.06475.
  154. Multimodal large language models: A survey. arXiv preprint arXiv:2311.13165.
  155. Next-gpt: Any-to-any multimodal llm. arXiv preprint arXiv:2309.05519.
  156. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5288–5296.
  157. Self Correspondence Distillation For End-to-End Weakly-Supervised Semantic Segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence.
  158. Rssformer: Foreground saliency enhancement for remote sensing land-cover segmentation. IEEE Transactions on Image Processing, 32:1052–1064.
  159. Video-text pre-training with learned regions. arXiv preprint arXiv:2112.01194.
  160. Vision-language pre-training with triple contrastive learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15671–15680.
  161. Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381.
  162. mplug-owl: Modularization empowers large language models with multimodality. arXiv preprint arXiv:2304.14178.
  163. A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549.
  164. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. arXiv preprint arXiv:2306.06687.
  165. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions. Transactions of the Association for Computational Linguistics, 2:67–78.
  166. Modeling context in referring expressions. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14, pages 69–85. Springer.
  167. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490.
  168. Point-bert: Pre-training 3d point cloud transformers with masked point modeling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19313–19322.
  169. Merlot reserve: Neural script knowledge through vision and language and sound. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16375–16387.
  170. GLM-130B: An Open Bilingual Pre-trained Model. In The Eleventh International Conference on Learning Representations.
  171. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. In International Conference on Machine Learning, pages 25994–26009. PMLR.
  172. SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6-10, 2023, pages 15757–15773.
  173. Continual Named Entity Recognition without Catastrophic Forgetting. In The 2023 Conference on Empirical Methods in Natural Language Processing.
  174. Task relation distillation and prototypical pseudo label for incremental named entity recognition. In Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 3319–3329.
  175. Decomposing Logits Distillation for Incremental Named Entity Recognition. In Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1919–1923.
  176. Recent Advances and New Frontiers in Spiking Neural Networks. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages 5670–5677.
  177. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023 - System Demonstrations, Singapore, December 6-10, 2023, pages 543–553.
  178. Side-tuning: a baseline for network adaptation via additive side networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16, pages 698–714. Springer.
  179. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.
  180. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107.
  181. Svit: Scaling up visual instruction tuning. arXiv preprint arXiv:2307.04087.
  182. Chatspot: Bootstrapping multimodal llms via precise referring instruction tuning. arXiv preprint arXiv:2307.09474.
  183. EGSDE: Unpaired Image-to-Image Translation via Energy-Guided Stochastic Differential Equations. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  184. A survey of large language models. arXiv preprint arXiv:2303.18223.
  185. Bubogpt: Enabling visual grounding in multi-modal llms. arXiv preprint arXiv:2307.08581.
  186. Beyond Anti-Forgetting: Multimodal Continual Instruction Tuning with Positive Forward Transfer. arXiv preprint arXiv:2401.09181.
  187. Learn or Recall? Revisiting Incremental Learning with Pre-trained Language Models. arXiv preprint arXiv:2312.07887.
  188. Minigpt-5: Interleaved vision-and-language generation via generative vokens. arXiv preprint arXiv:2310.02239.
  189. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
  190. Multimodal c4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939.
  191. Visual7w: Grounded question answering in images. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4995–5004.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Duzhen Zhang (28 papers)
  2. Yahan Yu (6 papers)
  3. Chenxing Li (33 papers)
  4. Jiahua Dong (48 papers)
  5. Dan Su (101 papers)
  6. Chenhui Chu (48 papers)
  7. Dong Yu (328 papers)
Citations (115)
Youtube Logo Streamline Icon: https://streamlinehq.com