Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MMInstruct: A High-Quality Multi-Modal Instruction Tuning Dataset with Extensive Diversity (2407.15838v2)

Published 22 Jul 2024 in cs.CV

Abstract: Despite the effectiveness of vision-language supervised fine-tuning in enhancing the performance of Vision LLMs (VLLMs). However, existing visual instruction tuning datasets include the following limitations: (1) Instruction annotation quality: despite existing VLLMs exhibiting strong performance, instructions generated by those advanced VLLMs may still suffer from inaccuracies, such as hallucinations. (2) Instructions and image diversity: the limited range of instruction types and the lack of diversity in image data may impact the model's ability to generate diversified and closer to real-world scenarios outputs. To address these challenges, we construct a high-quality, diverse visual instruction tuning dataset MMInstruct, which consists of 973K instructions from 24 domains. There are four instruction types: Judgement, Multiple-Choice, Long Visual Question Answering and Short Visual Question Answering. To construct MMInstruct, we propose an instruction generation data engine that leverages GPT-4V, GPT-3.5, and manual correction. Our instruction generation engine enables semi-automatic, low-cost, and multi-domain instruction generation at 1/6 the cost of manual construction. Through extensive experiment validation and ablation experiments, we demonstrate that MMInstruct could significantly improve the performance of VLLMs, e.g., the model fine-tuning on MMInstruct achieves new state-of-the-art performance on 10 out of 12 benchmarks. The code and data shall be available at https://github.com/yuecao0119/MMInstruct.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. Flamingo: a visual language model for few-shot learning. NeurIPS, 2022.
  2. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966, 2023.
  3. VLMo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems, 2022.
  4. Vl-beit: Generative vision-language pretraining, 2022.
  5. An augmented benchmark dataset for geometric question answering through dual parallel text encoding. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1511–1520, 2022.
  6. Mapqa: A dataset for question answering on choropleth maps. arXiv preprint arXiv:2211.08545, 2022.
  7. Unigeo: Unifying geometry logical reasoning via reformulating mathematical expression. arXiv preprint arXiv:2212.02746, 2022.
  8. Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
  9. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195, 2023.
  10. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793, 2023.
  11. Pali: A jointly-scaled multilingual language-image model. In ICLR, 2022.
  12. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821, 2024.
  13. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. arXiv preprint arXiv:2312.14238, 2023.
  14. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. arXiv preprint arXiv:2305.06500, 2023.
  16. Internlm-xcomposer2-4khd: A pioneering large vision-language model handling resolutions from 336 pixels to 4k hd. arXiv preprint arXiv:2404.06512, 2024.
  17. Eva: Exploring the limits of masked visual representation learning at scale. arXiv preprint arXiv:2211.07636, 2022.
  18. Mme: A comprehensive evaluation benchmark for multimodal large language models. arXiv preprint arXiv:2306.13394, 2023.
  19. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In CVPR, 2017.
  20. Vizwiz grand challenge: Answering visual questions from blind people. In CVPR, 2018.
  21. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In CVPR, 2019.
  22. IDEFICS. Introducing idefics: An open reproduction of state-of-the-art visual language model. https://huggingface.co/blog/idefics, 2023.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In Int. Conf. Mach. Learn., 2021.
  24. Dvqa: Understanding data visualizations via question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5648–5656, 2018.
  25. Figureqa: An annotated figure dataset for visual reasoning. arXiv preprint arXiv:1710.07300, 2017.
  26. Are you smarter than a sixth grader? textbook question answering for multimodal machine comprehension. In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007, 2017.
  27. Lisa: Reasoning segmentation via large language model. arXiv preprint arXiv:2308.00692, 2023.
  28. Mimic-it: Multi-modal in-context instruction tuning. arXiv preprint arXiv:2306.05425, 2023.
  29. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023.
  30. Uni-perceiver v2: A generalist model for large-scale vision and vision-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2691–2700, 2023.
  31. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597, 2023.
  32. Align before fuse: Vision and language representation learning with momentum distillation. NeurIPS, 2021.
  33. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  34. M 3 it: A large-scale dataset towards multi-modal multilingual instruction tuning. arXiv preprint arXiv:2306.04387, 2023.
  35. Evaluating object hallucination in large vision-language models. EMNLP, 2023.
  36. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14963–14973, 2023.
  37. Microsoft coco: Common objects in context. In ECCV, 2014.
  38. Clevr-math: A dataset for compositional language, visual and mathematical reasoning. arXiv preprint arXiv:2208.05358, 2022.
  39. Aligning large multi-modal model with robust instruction tuning. arXiv preprint arXiv:2306.14565, 2023.
  40. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  41. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.
  42. Mmbench: Is your multi-modal model an all-around player? arXiv preprint arXiv:2307.06281, 2023.
  43. Interngpt: Solving vision-centric tasks by interacting with chatbots beyond language. arXiv preprint arXiv:2305.05662, 2023.
  44. Inter-gps: Interpretable geometry problem solving with formal language and symbolic reasoning. arXiv preprint arXiv:2105.04165, 2021.
  45. Learn to explain: Multimodal reasoning via thought chains for science question answering. NeurIPS, 2022.
  46. Dynamic prompt learning via policy gradient for semi-structured mathematical reasoning. arXiv preprint arXiv:2209.14610, 2022.
  47. Valley: Video assistant with large language model enhanced ability, 2023.
  48. Macaw-llm: Multi-modal language modeling with image, audio, video, and text integration. arXiv preprint arXiv:2306.09093, 2023.
  49. OpenAI. Chatgpt, 2022.
  50. OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  51. OpenAI. Gpt-4v(ision) system card. 2023.
  52. TB OpenAI. Chatgpt: Optimizing language models for dialogue. OpenAI, 2022.
  53. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  54. Learning transferable visual models from natural language supervision. In ICML, 2021.
  55. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  56. Timechat: A time-sensitive multimodal large language model for long video understanding. ArXiv, abs/2312.02051, 2023.
  57. Laion-5b: An open large-scale dataset for training next generation image-text models. NeurIPS, 2022.
  58. Solving geometry problems: Combining text and diagram interpretation. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1466–1476, 2015.
  59. Towards vqa models that can read. In CVPR, 2019.
  60. Vl-bert: Pre-training of generic visual-linguistic representations. arXiv preprint arXiv:1908.08530, 2019.
  61. Instructdoc: A dataset for zero-shot generalization of visual document understanding with instructions. In AAAI, 2024.
  62. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  63. InternLM Team. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  64. Mm-interleaved: Interleaved image-text generative modeling via multi-modal feature synchronizer. arXiv preprint arXiv:2401.10208, 2024.
  65. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  66. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  67. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  68. The all-seeing project v2: Towards general relation comprehension of the open world. arXiv preprint arXiv:2402.19474, 2024.
  69. The all-seeing project: Towards panoptic visual recognition and understanding of the open world. arXiv preprint arXiv:2308.01907, 2023.
  70. Image as a foreign language: Beit pretraining for vision and vision-language tasks. In CVPR, 2023.
  71. Self-instruct: Aligning language models with self-generated instructions. arXiv preprint arXiv:2212.10560, 2022.
  72. Benchmarking generalization via in-context instructions on 1,600+ language tasks. arXiv preprint arXiv:2204.07705, 2, 2022.
  73. Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks. arXiv preprint arXiv:2406.08394, 2024.
  74. Vision-flan: Scaling human-labeled tasks in visual instruction tuning. arXiv preprint arXiv:2402.11690, 2024.
  75. Multiinstruct: Improving multi-modal zero-shot learning via instruction tuning. arXiv preprint arXiv:2212.10773, 2022.
  76. mplug-docowl: Modularized multimodal large language model for document understanding, 2023.
  77. Lamm: Language-assisted multi-modal instruction-tuning dataset, framework, and benchmark. Advances in Neural Information Processing Systems, 36, 2024.
  78. Coca: Contrastive captioners are image-text foundation models. arXiv preprint arXiv:2205.01917, 2022.
  79. Mm-vet: Evaluating large multimodal models for integrated capabilities. arXiv preprint arXiv:2308.02490, 2023.
  80. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  81. Llavar: Enhanced visual instruction tuning for text-rich image understanding. arXiv preprint arXiv:2306.17107, 2023.
  82. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
  83. Uni-perceiver-moe: Learning sparse generalist models with conditional moes. arXiv preprint arXiv:2206.04674, 2022.
  84. Uni-perceiver: Pre-training unified architecture for generic perception for zero-shot and few-shot tasks. arXiv preprint arXiv:2112.01522, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Yangzhou Liu (6 papers)
  2. Yue Cao (147 papers)
  3. Zhangwei Gao (9 papers)
  4. Weiyun Wang (20 papers)
  5. Zhe Chen (237 papers)
  6. Wenhai Wang (123 papers)
  7. Hao Tian (146 papers)
  8. Lewei Lu (55 papers)
  9. Xizhou Zhu (73 papers)
  10. Tong Lu (85 papers)
  11. Yu Qiao (563 papers)
  12. Jifeng Dai (131 papers)
Citations (6)
Github Logo Streamline Icon: https://streamlinehq.com

GitHub

X Twitter Logo Streamline Icon: https://streamlinehq.com