Emergent Mind

Abstract

Traditional computer vision generally solves each single task independently by a dedicated model with the task instruction implicitly designed in the model architecture, arising two limitations: (1) it leads to task-specific models, which require multiple models for different tasks and restrict the potential synergies from diverse tasks; (2) it leads to a pre-defined and fixed model interface that has limited interactivity and adaptability in following user' task instructions. To address them, Visual Instruction Tuning (VIT) has been intensively studied recently, which finetunes a large vision model with language as task instructions, aiming to learn from a wide range of vision tasks described by language instructions a general-purpose multimodal model that can follow arbitrary instructions and thus solve arbitrary tasks specified by the user. This work aims to provide a systematic review of visual instruction tuning, covering (1) the background that presents computer vision task paradigms and the development of VIT; (2) the foundations of VIT that introduce commonly used network architectures, visual instruction tuning frameworks and objectives, and evaluation setups and tasks; (3) the commonly used datasets in visual instruction tuning and evaluation; (4) the review of existing VIT methods that categorizes them with a taxonomy according to both the studied vision task and the method design and highlights the major contributions, strengths, and shortcomings of them; (5) the comparison and discussion of VIT methods over various instruction-following benchmarks; (6) several challenges, open directions and possible future works in visual instruction tuning research.

We're not able to analyze this paper right now due to high demand.

Please check back later (sorry!).

Sign up for a free account or log in to generate a summary of this paper:

We ran into a problem analyzing this paper.

Newsletter

Get summaries of trending comp sci papers delivered straight to your inbox:

Unsubscribe anytime.

References
  1. A. Voulodimos, N. Doulamis, A. Doulamis, E. Protopapadakis et al., “Deep learning for computer vision: A brief review,” Computational intelligence and neuroscience, vol. 2018
  2. J. Zhang and D. Tao, “Empowering things with intelligence: a survey of the progress, challenges, and opportunities in artificial intelligence of things,” IEEE Internet of Things Journal, vol. 8, no. 10, pp. 7789–7817
  3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” Communications of the ACM, vol. 60, no. 6, pp. 84–90
  4. Very Deep Convolutional Networks for Large-Scale Image Recognition
  5. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  6. L. Zheng, W.-L. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica, “Judging llm-as-a-judge with mt-bench and chatbot arena,” 2023.
  7. OpenAI, “Chatgpt,” https://openai.com/blog/chatgpt, 2020, accessed: 2023-09-15.

  8. OpenAI, “Gpt-4 technical report,” 2023
  9. Visual Instruction Tuning
  10. A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark et al., “Learning transferable visual models from natural language supervision,” in International Conference on Machine Learning.   PMLR, 2021, pp. 8748–8763.
  11. LAION-400M: Open Dataset of CLIP-Filtered 400 Million Image-Text Pairs
  12. B. Li, Y. Zhang, L. Chen, J. Wang, F. Pu, J. Yang, C. Li, and Z. Liu, “Mimic-it: Multi-modal in-context instruction tuning,” 2023.
  13. BuboGPT: Enabling Visual Grounding in Multi-Modal LLMs
  14. Valley: Video Assistant with Large Language model Enhanced abilitY
  15. Point-BERT: Pre-training 3D Point Cloud Transformers with Masked Point Modeling
  16. C. R. Qi, H. Su, K. Mo, and L. J. Guibas, “Pointnet: Deep learning on point sets for 3d classification and segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 652–660.
  17. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30
  18. LLaMA: Open and Efficient Foundation Language Models
  19. QLoRA: Efficient Finetuning of Quantized LLMs
  20. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning.   PMLR, 2023, pp. 28 492–28 518.
  21. J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition.   Ieee, 2009, pp. 248–255.
  22. M.-E. Nilsback and A. Zisserman, “Automated flower classification over a large number of classes,” in 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing.   IEEE, 2008, pp. 722–729.
  23. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in context,” in European conference on computer vision.   Springer, 2014, pp. 740–755.
  24. A. Gupta, P. Dollar, and R. Girshick, “Lvis: A dataset for large vocabulary instance segmentation,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5356–5364.
  25. S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg, “Referitgame: Referring to objects in photographs of natural scenes,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 787–798.
  26. Microsoft COCO Captions: Data Collection and Evaluation Server
  27. M. Deitke, D. Schwenk, J. Salvador, L. Weihs, O. Michel, E. VanderBilt, L. Schmidt, K. Ehsani, A. Kembhavi, and A. Farhadi, “Objaverse: A universe of annotated 3d objects,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 13 142–13 153.
  28. F. Perazzi, J. Pont-Tuset, B. McWilliams, L. Van Gool, M. Gross, and A. Sorkine-Hornung, “A benchmark dataset and evaluation methodology for video object segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 724–732.
  29. Point and Ask: Incorporating Pointing into Visual Question Answering
  30. K. Marino, M. Rastegari, A. Farhadi, and R. Mottaghi, “Ok-vqa: A visual question answering benchmark requiring external knowledge,” in Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, 2019, pp. 3195–3204.
  31. P. Young, A. Lai, M. Hodosh, and J. Hockenmaier, “From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions,” Transactions of the Association for Computational Linguistics, vol. 2, pp. 67–78
  32. Evaluating Object Hallucination in Large Vision-Language Models
  33. MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
  34. MMBench: Is Your Multi-modal Model an All-around Player?
  35. SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension
  36. MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
  37. MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models
  38. S. Changpinyo, P. Sharma, N. Ding, and R. Soricut, “Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 3558–3568.
  39. P. Sharma, N. Ding, S. Goodman, and R. Soricut, “Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2556–2565.
  40. K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio captioning dataset,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2020, pp. 736–740.
  41. H. Chen, W. Xie, T. Afouras, A. Nagrani, A. Vedaldi, and A. Zisserman, “Localizing visual sounds the hard way,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 16 867–16 876.
  42. DetGPT: Detect What You Need via Reasoning
  43. MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning
  44. Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
  45. ChatSpot: Bootstrapping Multimodal LLMs via Precise Referring Instruction Tuning
  46. The All-Seeing Project: Towards Panoptic Visual Recognition and Understanding of the Open World
  47. AnyMAL: An Efficient and Scalable Any-Modality Augmented Language Model
  48. Mitigating Hallucination in Large Multi-Modal Models via Robust Instruction Tuning
  49. VisIT-Bench: A Benchmark for Vision-Language Instruction Following Inspired by Real-World Use
  50. NExT-GPT: Any-to-Any Multimodal LLM
  51. Qilin-Med-VL: Towards Chinese Large Vision-Language Model for General Healthcare
  52. Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration
  53. G. Sigurdsson, O. Russakovsky, A. Farhadi, I. Laptev, and A. Gupta, “Much ado about time: Exhaustive annotation of temporal data,” in Proceedings of the AAAI Conference on Human Computation and Crowdsourcing, vol. 4, 2016, pp. 219–228.
  54. H. Alamri, V. Cartillier, A. Das, J. Wang, A. Cherian, I. Essa, D. Batra, T. K. Marks, C. Hori, P. Anderson et al., “Audio visual scene-aware dialog,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019, pp. 7558–7567.
  55. GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest
  56. MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
  57. Otter: A Multi-Modal Model with In-Context Instruction Tuning
  58. SVIT: Scaling up Visual Instruction Tuning
  59. Visual Instruction Tuning with Polite Flamingo
  60. ILuvUI: Instruction-tuned LangUage-Vision modeling of UIs from Machine Conversations
  61. StableLLaVA: Enhanced Visual Instruction Tuning with Synthesized Image-Dialogue Data
  62. X-LLM: Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages
  63. GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction
  64. LLaVAR: Enhanced Visual Instruction Tuning for Text-Rich Image Understanding
  65. Position-Enhanced Visual Instruction Tuning for Multimodal Large Language Models
  66. Sparkles: Unlocking Chats Across Multiple Images for Multimodal Instruction-Following Models
  67. Ferret: Refer and Ground Anything Anywhere at Any Granularity
  68. VIGC: Visual Instruction Generation and Correction
  69. M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning
  70. LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day
  71. PointLLM: Empowering Large Language Models to Understand Point Clouds
  72. TextBind: Multi-turn Interleaved Multimodal Instruction-following in the Wild
  73. ChatBridge: Bridging Modalities with Large Language Model as a Language Catalyst
  74. LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark
  75. VideoChat: Chat-Centric Video Understanding
  76. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models
  77. OphGLM: Training an Ophthalmology Large Language-and-Vision Assistant based on Instructions and Dialogue
  78. R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual genome: Connecting language and vision using crowdsourced dense image annotations,” International journal of computer vision, vol. 123, no. 1, pp. 32–73
  79. J. Mao, J. Huang, A. Toshev, O. Camburu, A. L. Yuille, and K. Murphy, “Generation and comprehension of unambiguous object descriptions,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 11–20.
  80. B. A. Plummer, L. Wang, C. M. Cervantes, J. C. Caicedo, J. Hockenmaier, and S. Lazebnik, “Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), December 2015.
  81. R. Zellers, Y. Bisk, A. Farhadi, and Y. Choi, “From recognition to cognition: Visual commonsense reasoning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
  82. Instruction Tuning with GPT-4
  83. A. Mishra, S. Shekhar, A. K. Singh, and A. Chakraborty, “Ocr-vqa: Visual question answering by reading text in images,” in 2019 international conference on document analysis and recognition (ICDAR).   IEEE, 2019, pp. 947–952.
  84. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 10 684–10 695.
  85. AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale
  86. R. Krishna, K. Hata, F. Ren, L. Fei-Fei, and J. Carlos Niebles, “Dense-captioning events in videos,” in Proceedings of the IEEE international conference on computer vision, 2017, pp. 706–715.
  87. S. Shao, Z. Li, T. Zhang, C. Peng, G. Yu, X. Zhang, J. Li, and J. Sun, “Objects365: A large-scale, high-quality dataset for object detection,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8430–8439.
  88. Scalable 3D Captioning with Pretrained Models
  89. M. Bain, A. Nagrani, G. Varol, and A. Zisserman, “Frozen in time: A joint video and image encoder for end-to-end retrieval,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 1728–1738.
  90. L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop.   IEEE, 2004, pp. 178–178.
  91. O. M. Parkhi, A. Vedaldi, A. Zisserman, and C. Jawahar, “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition.   IEEE, 2012, pp. 3498–3505.
  92. J. Krause, J. Deng, M. Stark, and L. Fei-Fei, “Collecting a large-scale dataset of fine-grained cars,” 2013.
  93. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman, “The pascal visual object classes (voc) challenge,” International journal of computer vision, vol. 88, no. 2, pp. 303–338
  94. Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, and D. Parikh, “Making the v in vqa matter: Elevating the role of image understanding in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 6904–6913.
  95. D. A. Hudson and C. D. Manning, “Gqa: A new dataset for real-world visual reasoning and compositional question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6700–6709.
  96. P. Lu, S. Mishra, T. Xia, L. Qiu, K.-W. Chang, S.-C. Zhu, O. Tafjord, P. Clark, and A. Kalyan, “Learn to explain: Multimodal reasoning via thought chains for science question answering,” Advances in Neural Information Processing Systems, vol. 35, pp. 2507–2521
  97. D. Gurari, Q. Li, A. J. Stangl, A. Guo, C. Lin, K. Grauman, J. Luo, and J. P. Bigham, “Vizwiz grand challenge: Answering visual questions from blind people,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 3608–3617.
  98. A. Singh, V. Natarajan, M. Shah, Y. Jiang, X. Chen, D. Batra, D. Parikh, and M. Rohrbach, “Towards vqa models that can read,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 8317–8326.
  99. H. Agrawal, K. Desai, Y. Wang, X. Chen, R. Jain, M. Johnson, D. Batra, D. Parikh, S. Lee, and P. Anderson, “Nocaps: Novel object captioning at scale,” in Proceedings of the IEEE/CVF international conference on computer vision, 2019, pp. 8948–8957.
  100. C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 119–132.
  101. X. Wang, J. Wu, J. Chen, L. Li, Y.-F. Wang, and W. Y. Wang, “Vatex: A large-scale, high-quality multilingual dataset for video-and-language research,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 4581–4591.
  102. D. Xu, Z. Zhao, J. Xiao, F. Wu, H. Zhang, X. He, and Y. Zhuang, “Video question answering via gradually refined attention over appearance and motion,” in Proceedings of the 25th ACM international conference on Multimedia, 2017, pp. 1645–1653.
  103. Y. Jang, Y. Song, Y. Yu, Y. Kim, and G. Kim, “Tgif-qa: Toward spatio-temporal reasoning in visual question answering,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2758–2766.
  104. Z. Yu, D. Xu, J. Yu, T. Yu, Z. Zhao, Y. Zhuang, and D. Tao, “Activitynet-qa: A dataset for understanding complex web videos via question answering,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 9127–9134.
  105. J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman, “A dataset of clinically generated visual questions and answers about radiology images,” Scientific data, vol. 5, no. 1, pp. 1–10
  106. B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y. Yang, and X.-M. Wu, “Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering,” in 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI).   IEEE, 2021, pp. 1650–1654.
  107. Pathological Visual Question Answering
  108. T. Li, Y. Gao, K. Wang, S. Guo, H. Liu, and H. Kang, “Diagnostic assessment of deep learning algorithms for diabetic retinopathy screening,” Information Sciences, vol. 501, pp. 511–522
  109. M. Mathew, D. Karatzas, and C. Jawahar, “Docvqa: A dataset for vqa on document images,” in Proceedings of the IEEE/CVF winter conference on applications of computer vision, 2021, pp. 2200–2209.
  110. ChartQA: A Benchmark for Question Answering about Charts with Visual and Logical Reasoning
  111. TabFact: A Large-scale Dataset for Table-based Fact Verification
  112. R. Tanaka, K. Nishida, and S. Yoshida, “Visualmrc: Machine reading comprehension on document images,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 15, 2021, pp. 13 878–13 888.
  113. Z. Wu, S. Song, A. Khosla, F. Yu, L. Zhang, X. Tang, and J. Xiao, “3d shapenets: A deep representation for volumetric shapes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2015, pp. 1912–1920.
  114. D. Azuma, T. Miyanishi, S. Kurita, and M. Kawanabe, “Scanqa: 3d question answering for spatial scene understanding,” in proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 19 129–19 139.
  115. A. Dai, A. X. Chang, M. Savva, M. Halber, T. Funkhouser, and M. Nießner, “Scannet: Richly-annotated 3d reconstructions of indoor scenes,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 5828–5839.
  116. Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT
  117. LISA: Reasoning Segmentation via Large Language Model
  118. VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks
  119. GLaMM: Pixel Grounding Large Multimodal Model
  120. LLaVA-Interactive: An All-in-One Demo for Image Chat, Segmentation, Generation and Editing
  121. DreamLLM: Synergistic Multimodal Comprehension and Creation
  122. Cheap and Quick: Efficient Vision-Language Instruction Tuning for Large Language Models
  123. SCITUNE: Aligning Large Language Models with Scientific Multimodal Instructions
  124. LMEye: An Interactive Perception Network for Large Language Models
  125. Fine-tuning Multimodal LLMs to Follow Zero-shot Demonstrative Instructions
  126. BLIVA: A Simple Multimodal LLM for Better Handling of Text-Rich Visual Questions
  127. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning
  128. mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration
  129. W. Dai, J. Li, D. Li, A. M. H. Tiong, J. Zhao, W. Wang, B. Li, P. Fung, and S. Hoi, “Instructblip: Towards general-purpose vision-language models with instruction tuning,” 2023.
  130. InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition
  131. LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model
  132. Improved Baselines with Visual Instruction Tuning
  133. AssistGPT: A General Multi-modal Assistant that can Plan, Execute, Inspect, and Learn
  134. PandaGPT: One Model To Instruction-Follow Them All
  135. Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
  136. CogVLM: Visual Expert for Pretrained Language Models
  137. Making LLaMA SEE and Draw with SEED Tokenizer
  138. OtterHD: A High-Resolution Multi-modality Model
  139. ImageBind-LLM: Multi-modality Instruction Tuning
  140. EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought
  141. Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding
  142. PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
  143. mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding
  144. mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model

Show All 144