COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968v1)
Abstract: Multi-modal LLMs (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting.
- OpenAI. Gpt-4 technical report, 2023.
- Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023.
- Albert Q. Jiang et al. Mistral 7b, 2023.
- Visual instruction tuning, 2023.
- Wenliang Dai et al. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Mimic-it: Multi-modal in-context instruction tuning, 2023.
- Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023.
- Svit: Scaling up visual instruction tuning, 2023.
- Zhenfei Yin et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
- To see is to believe: Prompting gpt-4v for better visual instruction tuning, 2023.
- Harsh Agrawal et al. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.
- Microsoft coco captions: Data collection and evaluation server, 2015.
- Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
- Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Danna Gurari et al. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
- Cider: Consensus-based image description evaluation, 2015.
- Chaoyou Fu et al. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
- Seed-bench-2: Benchmarking multimodal large language models, 2023.
- Xiang Yue et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
- Weihao Yu et al. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
- Infimm-eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023.
- Improved baselines with visual instruction tuning, 2023.
- Tsung-Yi Lin et al. Microsoft coco: Common objects in context, 2015.
- Ranjay Krishna et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
- Zhengyuan Yang et al. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
- Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023.
- Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
- Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning, 2022.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
- Christoph Schuhmann et al. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
- Coyo-700m: Image-text pair dataset, 2022.
- Bart Thomee et al. Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2):64–73, January 2016.
- Wanrong Zhu et al. Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
- Hugo Laurençon et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
- Weihan Wang et al. Cogvlm: Visual expert for pretrained language models, 2023.
- Haozhe Zhao et al. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
- Sharegpt4v: Improving large multi-modal models with better captions, 2023.
- A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR), 2017.
- Exploring models and data for image question answering, 2015.
- Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa), 2016.
- Visual spatial reasoning, 2023.
- Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
- A-okvqa: A benchmark for visual question answering using world knowledge. arXiv, 2022.
- An analysis of visual question answering algorithms, 2017.
- Select, substitute, search: A new benchmark for knowledge-augmented visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21. ACM, July 2021.
- R-vqa: Learning visual relation facts with semantic attention for visual question answering. In SIGKDD 2018, 2018.
- Interpretable counting for visual question answering, 2018.
- Tallyqa: Answering complex counting questions, 2018.
- Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions, 2018.
- Scene text visual question answering, 2019.
- Point and ask: Incorporating pointing into visual question answering, 2022.
- Improving generative visual dialog by answering diverse questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
- The photobook dataset: Building common ground through visually-grounded dialogue, 2019.
- Sparkles: Unlocking chats across multiple images for multimodal instruction-following models. arXiv preprint arXiv:2308.16463, 2023.
- Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
- Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models, 2023.
- Xiaotian Han (46 papers)
- Yiqi Wang (39 papers)
- Bohan Zhai (13 papers)
- Quanzeng You (41 papers)
- Hongxia Yang (130 papers)