Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

COCO is "ALL'' You Need for Visual Instruction Fine-tuning (2401.08968v1)

Published 17 Jan 2024 in cs.CV

Abstract: Multi-modal LLMs (MLLMs) are increasingly prominent in the field of artificial intelligence. Visual instruction fine-tuning (IFT) is a vital process for aligning MLLMs' output with user's intentions. High-quality and diversified instruction following data is the key to this fine-tuning process. Recent studies propose to construct visual IFT datasets through a multifaceted approach: transforming existing datasets with rule-based templates, employing GPT-4 for rewriting annotations, and utilizing GPT-4V for visual dataset pseudo-labeling. LLaVA-1.5 adopted similar approach and construct LLaVA-mix-665k, which is one of the simplest, most widely used, yet most effective IFT datasets today. Notably, when properly fine-tuned with this dataset, MLLMs can achieve state-of-the-art performance on several benchmarks. However, we noticed that models trained with this dataset often struggle to follow user instructions properly in multi-round dialog. In addition, tradition caption and VQA evaluation benchmarks, with their closed-form evaluation structure, are not fully equipped to assess the capabilities of modern open-ended generative MLLMs. This problem is not unique to the LLaVA-mix-665k dataset, but may be a potential issue in all IFT datasets constructed from image captioning or VQA sources, though the extent of this issue may vary. We argue that datasets with diverse and high-quality detailed instruction following annotations are essential and adequate for MLLMs IFT. In this work, we establish a new IFT dataset, with images sourced from the COCO dataset along with more diverse instructions. Our experiments show that when fine-tuned with out proposed dataset, MLLMs achieve better performance on open-ended evaluation benchmarks in both single-round and multi-round dialog setting.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. OpenAI. Gpt-4 technical report, 2023.
  2. Hugo Touvron et al. Llama 2: Open foundation and fine-tuned chat models, 2023.
  3. Albert Q. Jiang et al. Mistral 7b, 2023.
  4. Visual instruction tuning, 2023.
  5. Wenliang Dai et al. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  6. Mimic-it: Multi-modal in-context instruction tuning, 2023.
  7. Llavar: Enhanced visual instruction tuning for text-rich image understanding, 2023.
  8. Svit: Scaling up visual instruction tuning, 2023.
  9. Zhenfei Yin et al. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark, 2023.
  10. To see is to believe: Prompting gpt-4v for better visual instruction tuning, 2023.
  11. Harsh Agrawal et al. nocaps: novel object captioning at scale. In Proceedings of the IEEE International Conference on Computer Vision, pages 8948–8957, 2019.
  12. Microsoft coco captions: Data collection and evaluation server, 2015.
  13. Gqa: A new dataset for real-world visual reasoning and compositional question answering, 2019.
  14. Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
  15. Danna Gurari et al. Vizwiz grand challenge: Answering visual questions from blind people, 2018.
  16. Cider: Consensus-based image description evaluation, 2015.
  17. Chaoyou Fu et al. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2023.
  18. Seed-bench-2: Benchmarking multimodal large language models, 2023.
  19. Xiang Yue et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi, 2023.
  20. Weihao Yu et al. Mm-vet: Evaluating large multimodal models for integrated capabilities, 2023.
  21. Infimm-eval: Complex open-ended reasoning evaluation for multi-modal large language models, 2023.
  22. Improved baselines with visual instruction tuning, 2023.
  23. Tsung-Yi Lin et al. Microsoft coco: Common objects in context, 2015.
  24. Ranjay Krishna et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations, 2016.
  25. Zhengyuan Yang et al. Mm-react: Prompting chatgpt for multimodal reasoning and action, 2023.
  26. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face, 2023.
  27. Visual chatgpt: Talking, drawing and editing with visual foundation models, 2023.
  28. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond, 2023.
  29. Jean-Baptiste Alayrac et al. Flamingo: a visual language model for few-shot learning, 2022.
  30. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models, 2023.
  31. Minigpt-4: Enhancing vision-language understanding with advanced large language models, 2023.
  32. Christoph Schuhmann et al. Laion-5b: An open large-scale dataset for training next generation image-text models, 2022.
  33. Coyo-700m: Image-text pair dataset, 2022.
  34. Bart Thomee et al. Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2):64–73, January 2016.
  35. Wanrong Zhu et al. Multimodal C4: An open, billion-scale corpus of images interleaved with text. arXiv preprint arXiv:2304.06939, 2023.
  36. Hugo Laurençon et al. Obelics: An open web-scale filtered dataset of interleaved image-text documents, 2023.
  37. Weihan Wang et al. Cogvlm: Visual expert for pretrained language models, 2023.
  38. Haozhe Zhao et al. Mmicl: Empowering vision-language model with multi-modal in-context learning. arXiv preprint arXiv:2309.07915, 2023.
  39. Sharegpt4v: Improving large multi-modal models with better captions, 2023.
  40. A hierarchical approach for generating descriptive image paragraphs. In Computer Vision and Patterm Recognition (CVPR), 2017.
  41. Exploring models and data for image question answering, 2015.
  42. Visual7W: Grounded Question Answering in Images. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  43. The color of the cat is gray: 1 million full-sentences visual question answering (fsvqa), 2016.
  44. Visual spatial reasoning, 2023.
  45. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Conference on Computer Vision and Pattern Recognition (CVPR), 2019.
  46. A-okvqa: A benchmark for visual question answering using world knowledge. arXiv, 2022.
  47. An analysis of visual question answering algorithms, 2017.
  48. Select, substitute, search: A new benchmark for knowledge-augmented visual question answering. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR ’21. ACM, July 2021.
  49. R-vqa: Learning visual relation facts with semantic attention for visual question answering. In SIGKDD 2018, 2018.
  50. Interpretable counting for visual question answering, 2018.
  51. Tallyqa: Answering complex counting questions, 2018.
  52. Vqa-e: Explaining, elaborating, and enhancing your answers for visual questions, 2018.
  53. Scene text visual question answering, 2019.
  54. Point and ask: Incorporating pointing into visual question answering, 2022.
  55. Improving generative visual dialog by answering diverse questions. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2019.
  56. The photobook dataset: Building common ground through visually-grounded dialogue, 2019.
  57. Sparkles: Unlocking chats across multiple images for multimodal instruction-following models. arXiv preprint arXiv:2308.16463, 2023.
  58. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models, 2016.
  59. Jason Wei et al. Chain-of-thought prompting elicits reasoning in large language models, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaotian Han (46 papers)
  2. Yiqi Wang (39 papers)
  3. Bohan Zhai (13 papers)
  4. Quanzeng You (41 papers)
  5. Hongxia Yang (130 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com