Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data (2410.18558v1)

Published 24 Oct 2024 in cs.CL

Abstract: Vision-LLMs (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Phi-3 technical report: A highly capable language model locally on your phone.
  2. Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration. Comput. Intell., 40(1).
  3. Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
  4. BAAI. 2024a. Flagscale.
  5. BAAI. 2024b. Infinity instruct. arXiv preprint arXiv:2406.
  6. Qwen technical report. arXiv preprint arXiv:2309.16609.
  7. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966.
  8. PaliGemma: A versatile 3B VLM for transfer. arXiv preprint arXiv:2407.07726.
  9. Jimmy Carter. 2024. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.
  10. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model.
  11. Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 513–523. Association for Computational Linguistics.
  12. Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
  13. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.
  14. Cylingo. 2024. Xinyuan-vl-2b.
  15. Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  16. Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1080–1089. IEEE Computer Society.
  17. Unveiling encoder-free vision-language models. CoRR, abs/2406.11832.
  18. Cogview: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 19822–19835.
  19. Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.
  20. H2ovl-mississippi vision language models technical report.
  21. LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5356–5364. Computer Vision Foundation / IEEE.
  22. SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. CoRR, abs/1707.08435.
  23. Dan Hendrycks and Kevin Gimpel. 2023. Gaussian error linear units (gelus).
  24. Mini-monkey: Multi-scale adaptive cropping for multimodal large language models. arXiv preprint arXiv:2408.02034.
  25. Open-set image tagging with multi-grained text supervision. arXiv e-prints, pages arXiv–2310.
  26. A diagram is worth a dozen images. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 235–251. Springer.
  27. Building and better understanding vision-language models: insights and future directions. CoRR, abs/2408.12637.
  28. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326.
  29. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895.
  30. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR.
  31. MMC: advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 1287–1310. Association for Computational Linguistics.
  32. Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26286–26296. IEEE.
  33. Visual instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
  34. Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281.
  35. Deepseek-vl: Towards real-world vision-language understanding.
  36. Nvidia. 2024. Megatron-energon.
  37. OpenAI. 2024. Gpt-4v system card.
  38. William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE.
  39. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  40. Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
  41. Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. CoRR, abs/2406.17294.
  42. Improving image captioning with better use of caption. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7454–7464, Online. Association for Computational Linguistics.
  43. Generative multimodal models are in-context learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14398–14409. IEEE.
  44. Emu: Generative pretraining in multimodality. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
  45. Cambrian-1: A fully open, vision-centric exploration of multimodal llms. CoRR, abs/2406.16860.
  46. To see is to believe: Prompting GPT-4V for better visual instruction tuning. CoRR, abs/2311.07574.
  47. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR, abs/2409.12191.
  48. Cogvlm: Visual expert for pretrained language models. CoRR, abs/2311.03079.
  49. Emu3: Next-token prediction is all you need.
  50. Florence-2: Advancing a unified representation for a variety of vision tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 4818–4829. IEEE.
  51. Show-o: One single transformer to unify multimodal understanding and generation. CoRR, abs/2408.12528.
  52. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800.
  53. mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13040–13051. IEEE.
  54. Mm-vet: Evaluating large multimodal models for integrated capabilities. CoRR, abs/2308.02490.
  55. MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. CoRR, abs/2311.16502.
  56. Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986.
  57. Transfusion: Predict the next token and diffuse images with one multi-modal model. CoRR, abs/2408.11039.
  58. Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Shuhao Gu (21 papers)
  2. Jialing Zhang (4 papers)
  3. Siyuan Zhou (27 papers)
  4. Kevin Yu (20 papers)
  5. Zhaohu Xing (16 papers)
  6. Liangdong Wang (10 papers)
  7. Zhou Cao (2 papers)
  8. Jintao Jia (1 paper)
  9. Zhuoyi Zhang (4 papers)
  10. Yixuan Wang (95 papers)
  11. Zhenchong Hu (1 paper)
  12. Bo-Wen Zhang (15 papers)
  13. Jijie Li (11 papers)
  14. Dong Liang (154 papers)
  15. Yingli Zhao (5 papers)
  16. Yulong Ao (7 papers)
  17. Yaoqi Liu (4 papers)
  18. Fangxiang Feng (15 papers)
  19. Guang Liu (30 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.