Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
98 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Survey on Multimodal Large Language Models (2306.13549v4)

Published 23 Jun 2023 in cs.CV, cs.AI, cs.CL, and cs.LG
A Survey on Multimodal Large Language Models

Abstract: Recently, Multimodal LLM (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful LLMs as a brain to perform multimodal tasks. The surprising emergent capabilities of MLLM, such as writing stories based on images and OCR-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even better than GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First of all, we present the basic formulation of MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages, and scenarios. We continue with multimodal hallucination and extended techniques, including Multimodal ICL (M-ICL), Multimodal CoT (M-CoT), and LLM-Aided Visual Reasoning (LAVR). To conclude the paper, we discuss existing challenges and point out promising research directions. In light of the fact that the era of MLLM has only just begun, we will keep updating this survey and hope it can inspire more research. An associated GitHub link collecting the latest papers is available at https://github.com/BradyFU/Awesome-Multimodal-Large-Language-Models.

An Expert Survey on Multimodal LLMs

The paper "A Survey on Multimodal LLMs" authored by Shukang Yin et al., offers a comprehensive review of the recent advancements and methodologies involved in the development of Multimodal LLMs (MLLM). These models leverage LLMs as a core reasoning entity to handle multimodal tasks that incorporate both vision and language. This survey explores the emergent capabilities, underlying methodologies, key challenges, and promising directions in MLLM research.

Overview of MLLM Capabilities

MLLMs mark a significant departure from traditional unimodal models by incorporating multimodal inputs for more complex tasks. The authors classify MLLM advancements into four broad techniques: Multimodal Instruction Tuning (M-IT), Multimodal In-Context Learning (M-ICL), Multimodal Chain of Thought (M-CoT), and LLM-Aided Visual Reasoning (LAVR).

Multimodal Instruction Tuning (M-IT)

M-IT extends the concept of instruction tuning, initially validated in text-based LLMs, to the multimodal field. Techniques like In-Context Learning (ICL) and multimodal chain-of-thought (CoT) reasoning have shown effectiveness in adapting LLMs to new tasks without extensive retraining. M-IT necessitates both architectural and data adaptations:

  1. Data Collection: The survey details various approaches to constructing multimodal instruction datasets. These approaches include benchmark adaptation, self-instruction, and hybrid composition of multimodal and language-only data.
  2. Modality Bridging: Connecting visual inputs with LLMs involves learnable interfaces or expert models. A common method is to use projection-based or query-based techniques to integrate visual features into the LLM’s reasoning process.
  3. Evaluation: Evaluating MLLMs requires both closed-set and open-set methodologies. Closed-set evaluation uses predefined datasets, while open-set evaluation involves manual or GPT-based scoring to assess the flexibility and generalization of the model in new or unseen tasks.

Multimodal In-Context Learning (M-ICL)

Building on the emergent abilities of LLMs in ICL, M-ICL allows multimodal models to generalize from a few examples provided in the context of new queries. This is especially practical for visual reasoning tasks and tool usage scenarios where generating step-by-step reasoning or action plans is crucial.

Multimodal Chain of Thought (M-CoT)

M-CoT adapts the CoT mechanism to the multimodal context, enabling LLMs to generate intermediate reasoning steps. This not only aids in better task performance but also enhances model interpretability. Key aspects of M-CoT include:

  1. Modality Bridging: Similar to M-IT, modality bridging in M-CoT either involves direct integration of visual features or the use of expert models for generating descriptive text from visual inputs.
  2. Learning Paradigms: Methods like finetuning and zero/few-shot learning are employed. Finetuning involves specific datasets designed for CoT learning, while zero/few-shot setups use predefined or adaptive chain configurations.
  3. Chain Configuration: Chains can be either pre-defined or adaptively determined during reasoning, impacting the model's ability to handle complex problems flexibly.

LLM-Aided Visual Reasoning (LAVR)

In LAVR, LLMs enhance visual reasoning tasks by acting as controllers, decision-makers, or semantics refiners. Various training paradigms are employed:

  1. Training Paradigms: Most LAVR systems operate in a training-free manner, leveraging pre-trained models for zero/few-shot learning. A notable exception is GPT4Tools, which uses finetuning with an M-IT dataset.
  2. Functions:
    • Controller: LLMs decompose complex tasks into simpler sub-tasks, assigning them to appropriate tools or modules.
    • Decision Maker: In multi-round setups, LLMs continuously evaluate and refine responses based on iterative feedback.
    • Semantics Refiner: LLMs utilize their linguistic knowledge to generate or refine textual information.
  3. Evaluation: Performance is assessed using benchmark metrics or manual evaluation, depending on the task's nature and the model's capabilities.

Challenges and Future Directions

The paper emphasizes that MLLM research is nascent and identifies several key challenges:

  • The perception capabilities of MLLMs are still limited by the information bottlenecks in modalities integration.
  • The robustness of reasoning chains in MLLMs requires more investigation, especially under complex querying conditions.
  • Enhancing instruction following and alleviating object hallucination are crucial for practical reliability.
  • Novel, parameter-efficient training techniques are necessary to harness the full potential of MLLMs.

Conclusion

The authors provide an exhaustive review of MLLM advancements, methodologies, challenges, and potential research directions. The survey is valuable for researchers aiming to leverage multimodal data and approaches to improve AI's reasoning and perception capabilities. The integration of M-IT, M-ICL, M-CoT, and LAVR strategies holds promise for the next generation of intelligent systems capable of more nuanced and complex task handling. This work is a critical resource, offering a structured perspective on current progress and future opportunities in the burgeoning field of MLLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (107)
  1. OpenAI, “Chatgpt: A language model for conversational ai,” OpenAI, Tech. Rep., 2023. [Online]. Available: https://www.openai.com/research/chatgpt
  2. OpenAI, “Gpt-4 technical report,” arXiv:2303.08774, 2023.
  3. W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al., “Vicuna: An open-source chatbot impressing gpt-4 with 90% chatgpt quality,” 2023. [Online]. Available: https://vicuna.lmsys.org
  4. H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv:2302.13971, 2023.
  5. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” NeurIPS, 2020.
  6. B. Peng, C. Li, P. He, M. Galley, and J. Gao, “Instruction tuning with gpt-4,” arXiv:2304.03277, 2023.
  7. J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. Chi, Q. Le, and D. Zhou, “Chain of thought prompting elicits reasoning in large language models,” arXiv:2201.11903, 2022.
  8. A. Kirillov, E. Mintun, N. Ravi, H. Mao, C. Rolland, L. Gustafson, T. Xiao, S. Whitehead, A. C. Berg, W.-Y. Lo et al., “Segment anything,” arXiv:2304.02643, 2023.
  9. H. Zhang, F. Li, S. Liu, L. Zhang, H. Su, J. Zhu, L. M. Ni, and H.-Y. Shum, “Dino: Detr with improved denoising anchor boxes for end-to-end object detection,” arXiv:2203.03605, 2022.
  10. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby et al., “Dinov2: Learning robust visual features without supervision,” arXiv:2304.07193, 2023.
  11. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  12. P. Wang, A. Yang, R. Men, J. Lin, S. Bai, Z. Li, J. Ma, C. Zhou, J. Zhou, and H. Yang, “Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML, 2022.
  13. D. Zhu, J. Chen, X. Shen, X. Li, and M. Elhoseiny, “Minigpt-4: Enhancing vision-language understanding with advanced large language models,” arXiv:2304.10592, 2023.
  14. Z. Yang, L. Li, J. Wang, K. Lin, E. Azarnasab, F. Ahmed, Z. Liu, C. Liu, M. Zeng, and L. Wang, “Mm-react: Prompting chatgpt for multimodal reasoning and action,” arXiv:2303.11381, 2023.
  15. D. Driess, F. Xia, M. S. Sajjadi, C. Lynch, A. Chowdhery, B. Ichter, A. Wahid, J. Tompson, Q. Vuong, T. Yu et al., “Palm-e: An embodied multimodal language model,” arXiv:2303.03378, 2023.
  16. J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le, “Finetuned language models are zero-shot learners,” arXiv:2109.01652, 2021.
  17. L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” NeurIPS, 2022.
  18. H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y. Tay, W. Fedus, E. Li, X. Wang, M. Dehghani, S. Brahma et al., “Scaling instruction-finetuned language models,” arXiv:2210.11416, 2022.
  19. S. Iyer, X. V. Lin, R. Pasunuru, T. Mihaylov, D. Simig, P. Yu, K. Shuster, T. Wang, Q. Liu, P. S. Koura et al., “Opt-iml: Scaling language model instruction meta learning through the lens of generalization,” arXiv:2212.12017, 2022.
  20. V. Sanh, A. Webson, C. Raffel, S. H. Bach, L. Sutawika, Z. Alyafeai, A. Chaffin, A. Stiegler, T. L. Scao, A. Raja et al., “Multitask prompted training enables zero-shot task generalization,” arXiv:2110.08207, 2021.
  21. H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” arXiv:2304.08485, 2023.
  22. Q. Ye, H. Xu, G. Xu, J. Ye, M. Yan, Y. Zhou, J. Wang, A. Hu, P. Shi, Y. Shi et al., “mplug-owl: Modularization empowers large language models with multimodality,” arXiv:2304.14178, 2023.
  23. W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” arXiv:2305.06500, 2023.
  24. W. Wang, Z. Chen, X. Chen, J. Wu, X. Zhu, G. Zeng, P. Luo, T. Lu, J. Zhou, Y. Qiao et al., “Visionllm: Large language model is also an open-ended decoder for vision-centric tasks,” arXiv:2305.11175, 2023.
  25. F. Chen, M. Han, H. Zhao, Q. Zhang, J. Shi, S. Xu, and B. Xu, “X-llm: Bootstrapping advanced large language models by treating multi-modalities as foreign languages,” arXiv:2305.04160, 2023.
  26. Z. Xu, Y. Shen, and L. Huang, “Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning,” arXiv:2212.10773, 2022.
  27. B. Li, Y. Zhang, L. Chen, J. Wang, J. Yang, and Z. Liu, “Otter: A multi-modal model with in-context instruction tuning,” arXiv:2305.03726, 2023.
  28. R. Zhang, J. Han, A. Zhou, X. Hu, S. Yan, P. Lu, H. Li, P. Gao, and Y. Qiao, “Llama-adapter: Efficient fine-tuning of language models with zero-init attention,” arXiv:2303.16199, 2023.
  29. Z. Zhao, L. Guo, T. Yue, S. Chen, S. Shao, X. Zhu, Z. Yuan, and J. Liu, “Chatbridge: Bridging modalities with large language model as a language catalyst,” arXiv:2305.16103, 2023.
  30. X. Zhang, C. Wu, Z. Zhao, W. Lin, Y. Zhang, Y. Wang, and W. Xie, “Pmc-vqa: Visual instruction tuning for medical visual question answering,” arXiv:2305.10415, 2023.
  31. T. Gong, C. Lyu, S. Zhang, Y. Wang, M. Zheng, Q. Zhao, K. Liu, W. Zhang, P. Luo, and K. Chen, “Multimodal-gpt: A vision and language model for dialogue with humans,” arXiv:2305.04790, 2023.
  32. G. Luo, Y. Zhou, T. Ren, S. Chen, X. Sun, and R. Ji, “Cheap and quick: Efficient vision-language instruction tuning for large language models,” arXiv:2305.15023, 2023.
  33. K. Li, Y. He, Y. Wang, Y. Li, W. Wang, P. Luo, Y. Wang, L. Wang, and Y. Qiao, “Videochat: Chat-centric video understanding,” arXiv:2305.06355, 2023.
  34. R. Yang, L. Song, Y. Li, S. Zhao, Y. Ge, X. Li, and Y. Shan, “Gpt4tools: Teaching large language model to use tools via self-instruction,” arXiv:2305.18752, 2023.
  35. P. Gao, J. Han, R. Zhang, Z. Lin, S. Geng, A. Zhou, W. Zhang, P. Lu, C. He, X. Yue et al., “Llama-adapter v2: Parameter-efficient visual instruction model,” arXiv:2304.15010, 2023.
  36. L. Li, Y. Yin, S. Li, L. Chen, P. Wang, S. Ren, M. Li, Y. Yang, J. Xu, X. Sun, L. Kong, and Q. Liu, “M33{}^{3}start_FLOATSUPERSCRIPT 3 end_FLOATSUPERSCRIPTit: A large-scale dataset towards multi-modal multilingual instruction tuning,” arXiv:2306.04387, 2023.
  37. C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao, “Llava-med: Training a large language-and-vision assistant for biomedicine in one day,” arXiv:2306.00890, 2023.
  38. R. Pi, J. Gao, S. Diao, R. Pan, H. Dong, J. Zhang, L. Yao, J. Han, H. Xu, and L. K. T. Zhang, “Detgpt: Detect what you need via reasoning,” arXiv:2305.14167, 2023.
  39. Z. Yin, J. Wang, J. Cao, Z. Shi, D. Liu, M. Li, L. Sheng, L. Bai, X. Huang, Z. Wang et al., “Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark,” arXiv:2306.06687, 2023.
  40. M. Maaz, H. Rasheed, S. Khan, and F. S. Khan, “Video-chatgpt: Towards detailed video understanding via large vision and language models,” arXiv:2306.05424, 2023.
  41. C. Lyu, M. Wu, L. Wang, X. Huang, B. Liu, Z. Du, S. Shi, and Z. Tu, “Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration,” arXiv:2306.09093, 2023.
  42. H. Zhang, X. Li, and L. Bing, “Video-llama: An instruction-tuned audio-visual language model for video understanding,” arXiv:2306.02858, 2023.
  43. Y. Su, T. Lan, H. Li, J. Xu, Y. Wang, and D. Cai, “Pandagpt: One model to instruction-follow them all,” arXiv:2305.16355, 2023.
  44. Y. Li, Y. Du, K. Zhou, J. Wang, W. X. Zhao, and J.-R. Wen, “Evaluating object hallucination in large vision-language models,” arXiv:2305.10355, 2023.
  45. Y. Zhao, T. Pang, C. Du, X. Yang, C. Li, N.-M. Cheung, and M. Lin, “On evaluating adversarial robustness of large vision-language models,” arXiv:2305.16934, 2023.
  46. S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L. Zitnick, and D. Parikh, “Vqa: Visual question answering,” in ICCV, 2015.
  47. A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in CVPR, 2015.
  48. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in ECCV, 2014.
  49. V. Ordonez, G. Kulkarni, and T. Berg, “Im2text: Describing images using 1 million captioned photographs,” NeurIPS, 2011.
  50. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in ACL, 2018.
  51. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in CVPR, 2021.
  52. C. Schuhmann, R. Vencu, R. Beaumont, R. Kaczmarczyk, C. Mullis, A. Katta, T. Coombes, J. Jitsev, and A. Komatsuzaki, “Laion-400m: Open dataset of clip-filtered 400 million image-text pairs,” arXiv:2111.02114, 2021.
  53. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” IJCV, 2017.
  54. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in ICCV, 2015.
  55. J. Wu, H. Zheng, B. Zhao, Y. Li, B. Yan, R. Liang, W. Wang, S. Zhou, G. Lin, Y. Fu et al., “Ai challenger: A large-scale dataset for going deeper in image understanding,” arXiv:1711.06475, 2017.
  56. J. Gu, X. Meng, G. Lu, L. Hou, N. Minzhe, X. Liang, L. Yao, R. Huang, W. Zhang, X. Jiang et al., “Wukong: A 100 million large-scale chinese cross-modal pre-training benchmark,” NeurIPS, 2022.
  57. X. Mei, C. Meng, H. Liu, Q. Kong, T. Ko, C. Zhao, M. D. Plumbley, Y. Zou, and W. Wang, “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research,” arXiv:2303.17395, 2023.
  58. H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA, 2017.
  59. J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transforming mandarin asr research into industrial scale,” arXiv:1808.10583, 2018.
  60. Y. Wang, Y. Kordi, S. Mishra, A. Liu, N. A. Smith, D. Khashabi, and H. Hajishirzi, “Self-instruct: Aligning language model with self generated instructions,” arXiv:2212.10560, 2022.
  61. I. J. Goodfellow, M. Mirza, D. Xiao, A. Courville, and Y. Bengio, “An empirical investigation of catastrophic forgetting in gradient-based neural networks,” arXiv:1312.6211, 2013.
  62. N. Carion, F. Massa, G. Synnaeve, N. Usunier, A. Kirillov, and S. Zagoruyko, “End-to-end object detection with transformers,” in ECCV, 2020.
  63. J.-B. Alayrac, J. Donahue, P. Luc, A. Miech, I. Barr, Y. Hasson, K. Lenc, A. Mensch, K. Millican, M. Reynolds et al., “Flamingo: a visual language model for few-shot learning,” NeurIPS, 2022.
  64. J. Li, D. Li, S. Savarese, and S. Hoi, “Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models,” arXiv:2301.12597, 2023.
  65. P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” NeurIPS, 2022.
  66. R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus-based image description evaluation,” in CVPR, 2015.
  67. H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel object captioning at scale,” in ICCV, 2019.
  68. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” TACL, 2014.
  69. X. He, Y. Zhang, L. Mou, E. Xing, and P. Xie, “Pathvqa: 30000+ questions for medical visual question answering,” arXiv:2003.10286, 2020.
  70. J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,” Sci. Data, 2018.
  71. B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in ISBI, 2021.
  72. P. Xu, W. Shao, K. Zhang, P. Gao, S. Liu, M. Lei, F. Meng, S. Huang, Y. Qiao, and P. Luo, “Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models,” arXiv:2306.09265, 2023.
  73. C. Fu, P. Chen, Y. Shen, Y. Qin, M. Zhang, X. Lin, Z. Qiu, W. Lin, Z. Qiu, W. Lin et al., “Mme: A comprehensive evaluation benchmark for multimodal large language models,” arXiv, 2023.
  74. Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv:2301.00234, 2022.
  75. P. Lu, B. Peng, H. Cheng, M. Galley, K.-W. Chang, Y. N. Wu, S.-C. Zhu, and J. Gao, “Chameleon: Plug-and-play compositional reasoning with large language models,” arXiv:2304.09842, 2023.
  76. T. Gupta and A. Kembhavi, “Visual programming: Compositional visual reasoning without training,” in CVPR, 2023.
  77. Y. Lu, M. Bartolo, A. Moore, S. Riedel, and P. Stenetorp, “Fantastically ordered prompts and where to find them: Overcoming few-shot prompt order sensitivity,” arXiv:2104.08786, 2021.
  78. Z. Yang, Z. Gan, J. Wang, X. Hu, Y. Lu, Z. Liu, and L. Wang, “An empirical study of gpt-3 for few-shot knowledge-based vqa,” in AAAI, 2022.
  79. M. Tsimpoukelli, J. L. Menick, S. Cabi, S. Eslami, O. Vinyals, and F. Hill, “Multimodal few-shot learning with frozen language models,” NeurIPS, 2021.
  80. Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang, “Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface,” arXiv:2303.17580, 2023.
  81. J. Ge, H. Luo, S. Qian, Y. Gan, J. Fu, and S. Zhan, “Chain of thought prompt tuning in vision language models,” arXiv:2304.07919, 2023.
  82. Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” arXiv:2302.00923, 2023.
  83. C. Wu, S. Yin, W. Qi, X. Wang, Z. Tang, and N. Duan, “Visual chatgpt: Talking, drawing and editing with visual foundation models,” arXiv:2303.04671, 2023.
  84. T. Wang, J. Zhang, J. Fei, Y. Ge, H. Zheng, Y. Tang, Z. Li, M. Gao, S. Zhao, Y. Shan et al., “Caption anything: Interactive image description with diverse multimodal controls,” arXiv:2305.02677, 2023.
  85. D. Rose, V. Himakunthala, A. Ouyang, R. He, A. Mei, Y. Lu, M. Saxon, C. Sonar, D. Mirza, and W. Y. Wang, “Visual chain of thought: Bridging logical gaps with multimodal infillings,” arXiv:2305.02317, 2023.
  86. V. Himakunthala, A. Ouyang, D. Rose, R. He, A. Mei, Y. Lu, C. Sonar, M. Saxon, and W. Y. Wang, “Let’s think frame by frame: Evaluating video chain of thought with video infilling and prediction,” arXiv:2305.13903, 2023.
  87. T. Kojima, S. S. Gu, M. Reid, Y. Matsuo, and Y. Iwasawa, “Large language models are zero-shot reasoners,” arXiv:2205.11916, 2022.
  88. Z. Zhang, A. Zhang, M. Li, and A. Smola, “Automatic chain of thought prompting in large language models,” arXiv:2210.03493, 2022.
  89. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” NeurIPS, 2017.
  90. R. Zhang, X. Hu, B. Li, S. Huang, H. Deng, Y. Qiao, P. Gao, and H. Li, “Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners,” in CVPR, 2023.
  91. H. You, R. Sun, Z. Wang, L. Chen, G. Wang, H. A. Ayyubi, K.-W. Chang, and S.-F. Chang, “Idealgpt: Iteratively decomposing vision and language reasoning via large language models,” arXiv:2305.14985, 2023.
  92. D. Zhu, J. Chen, K. Haydarov, X. Shen, W. Zhang, and M. Elhoseiny, “Chatgpt asks, blip-2 answers: Automatic questioning towards enriched visual descriptions,” arXiv:2303.06594, 2023.
  93. X. Zhu, R. Zhang, B. He, Z. Zeng, S. Zhang, and P. Gao, “Pointclip v2: Adapting clip for powerful 3d open-world learning,” arXiv:2211.11682, 2022.
  94. A. Zeng, A. Wong, S. Welker, K. Choromanski, F. Tombari, A. Purohit, M. Ryoo, V. Sindhwani, J. Lee, V. Vanhoucke et al., “Socratic models: Composing zero-shot multimodal reasoning with language,” arXiv:2204.00598, 2022.
  95. A. Parisi, Y. Zhao, and N. Fiedel, “Talm: Tool augmented language models,” arXiv:2205.12255, 2022.
  96. L. Gao, A. Madaan, S. Zhou, U. Alon, P. Liu, Y. Yang, J. Callan, and G. Neubig, “Pal: Program-aided language models,” arXiv:2211.10435, 2022.
  97. T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: Language models can teach themselves to use tools,” arXiv:2302.04761, 2023.
  98. R. Nakano, J. Hilton, S. Balaji, J. Wu, L. Ouyang, C. Kim, C. Hesse, S. Jain, V. Kosaraju, W. Saunders et al., “Webgpt: Browser-assisted question-answering with human feedback,” arXiv:2112.09332, 2021.
  99. P. Anderson, X. He, C. Buehler, D. Teney, M. Johnson, S. Gould, and L. Zhang, “Bottom-up and top-down attention for image captioning and visual question answering,” in CVPR, 2018.
  100. Z. Yu, J. Yu, Y. Cui, D. Tao, and Q. Tian, “Deep modular co-attention networks for visual question answering,” in CVPR, 2019.
  101. P. Gao, Z. Jiang, H. You, P. Lu, S. C. Hoi, X. Wang, and H. Li, “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in CVPR, 2019.
  102. C. Gan, Z. Gan, X. He, J. Gao, and L. Deng, “Stylenet: Generating attractive visual captions with styles,” in CVPR, 2017.
  103. A. Mathews, L. Xie, and X. He, “Senticap: Generating image descriptions with sentiments,” in AAAI, 2016.
  104. D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, O. Bousquet, Q. Le, and E. Chi, “Least-to-most prompting enables complex reasoning in large language models,” arXiv:2205.10625, 2022.
  105. P. Lu, L. Qiu, K.-W. Chang, Y. N. Wu, S.-C. Zhu, T. Rajpurohit, P. Clark, and A. Kalyan, “Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning,” arXiv:2209.14610, 2022.
  106. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in CVPR, 2019.
  107. N. Xie, F. Lai, D. Doran, and A. Kadav, “Visual entailment: A novel task for fine-grained image understanding,” arXiv:1901.06706, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Shukang Yin (7 papers)
  2. Chaoyou Fu (46 papers)
  3. Sirui Zhao (17 papers)
  4. Ke Li (722 papers)
  5. Xing Sun (93 papers)
  6. Tong Xu (113 papers)
  7. Enhong Chen (242 papers)
Citations (374)
Youtube Logo Streamline Icon: https://streamlinehq.com