Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data (2410.18558v1)
Abstract: Vision-LLMs (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.
- Phi-3 technical report: A highly capable language model locally on your phone.
- Enhancing scene-text visual question answering with relational reasoning, attention and dynamic vocabulary integration. Comput. Intell., 40(1).
- Flamingo: a visual language model for few-shot learning. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
- BAAI. 2024a. Flagscale.
- BAAI. 2024b. Infinity instruct. arXiv preprint arXiv:2406.
- Qwen technical report. arXiv preprint arXiv:2309.16609.
- Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966.
- PaliGemma: A versatile 3B VLM for transfer. arXiv preprint arXiv:2407.07726.
- Jimmy Carter. 2024. Textocr-gpt4v. https://huggingface.co/datasets/jimmycarter/textocr-gpt4v.
- Allava: Harnessing gpt4v-synthesized data for a lite vision-language model.
- Geoqa: A geometric question answering benchmark towards multimodal numerical reasoning. In Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, volume ACL/IJCNLP 2021 of Findings of ACL, pages 513–523. Association for Computational Linguistics.
- Sharegpt4v: Improving large multi-modal models with better captions. arXiv preprint arXiv:2311.12793.
- How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821.
- Cylingo. 2024. Xinyuan-vl-2b.
- Instructblip: Towards general-purpose vision-language models with instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Visual dialog. In 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 1080–1089. IEEE Computer Society.
- Unveiling encoder-free vision-language models. CoRR, abs/2406.11832.
- Cogview: Mastering text-to-image generation via transformers. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 19822–19835.
- Vlmevalkit: An open-source toolkit for evaluating large multi-modality models.
- H2ovl-mississippi vision language models technical report.
- LVIS: A dataset for large vocabulary instance segmentation. In IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long Beach, CA, USA, June 16-20, 2019, pages 5356–5364. Computer Vision Foundation / IEEE.
- SPEECH-COCO: 600k visually grounded spoken captions aligned to MSCOCO data set. CoRR, abs/1707.08435.
- Dan Hendrycks and Kevin Gimpel. 2023. Gaussian error linear units (gelus).
- Mini-monkey: Multi-scale adaptive cropping for multimodal large language models. arXiv preprint arXiv:2408.02034.
- Open-set image tagging with multi-grained text supervision. arXiv e-prints, pages arXiv–2310.
- A diagram is worth a dozen images. In ECCV (4), volume 9908 of Lecture Notes in Computer Science, pages 235–251. Springer.
- Building and better understanding vision-language models: insights and future directions. CoRR, abs/2408.12637.
- Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326.
- Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models. arXiv preprint arXiv:2407.07895.
- BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 19730–19742. PMLR.
- MMC: advancing multimodal chart understanding with large-scale instruction tuning. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 1287–1310. Association for Computational Linguistics.
- Improved baselines with visual instruction tuning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 26286–26296. IEEE.
- Visual instruction tuning. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
- Mmbench: Is your multi-modal model an all-around player? CoRR, abs/2307.06281.
- Deepseek-vl: Towards real-world vision-language understanding.
- Nvidia. 2024. Megatron-energon.
- OpenAI. 2024. Gpt-4v system card.
- William Peebles and Saining Xie. 2023. Scalable diffusion models with transformers. In IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 4172–4182. IEEE.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
- Zero-shot text-to-image generation. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8821–8831. PMLR.
- Math-llava: Bootstrapping mathematical reasoning for multimodal large language models. CoRR, abs/2406.17294.
- Improving image captioning with better use of caption. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7454–7464, Online. Association for Computational Linguistics.
- Generative multimodal models are in-context learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 14398–14409. IEEE.
- Emu: Generative pretraining in multimodality. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Cambrian-1: A fully open, vision-centric exploration of multimodal llms. CoRR, abs/2406.16860.
- To see is to believe: Prompting GPT-4V for better visual instruction tuning. CoRR, abs/2311.07574.
- Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. CoRR, abs/2409.12191.
- Cogvlm: Visual expert for pretrained language models. CoRR, abs/2311.03079.
- Emu3: Next-token prediction is all you need.
- Florence-2: Advancing a unified representation for a variety of vision tasks. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 4818–4829. IEEE.
- Show-o: One single transformer to unify multimodal understanding and generation. CoRR, abs/2408.12528.
- Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800.
- mplug-owi2: Revolutionizing multi-modal large language model with modality collaboration. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13040–13051. IEEE.
- Mm-vet: Evaluating large multimodal models for integrated capabilities. CoRR, abs/2308.02490.
- MMMU: A massive multi-discipline multimodal understanding and reasoning benchmark for expert AGI. CoRR, abs/2311.16502.
- Sigmoid loss for language image pre-training. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 11975–11986.
- Transfusion: Predict the next token and diffuse images with one multi-modal model. CoRR, abs/2408.11039.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
- Shuhao Gu (21 papers)
- Jialing Zhang (4 papers)
- Siyuan Zhou (27 papers)
- Kevin Yu (20 papers)
- Zhaohu Xing (16 papers)
- Liangdong Wang (10 papers)
- Zhou Cao (2 papers)
- Jintao Jia (1 paper)
- Zhuoyi Zhang (4 papers)
- Yixuan Wang (95 papers)
- Zhenchong Hu (1 paper)
- Bo-Wen Zhang (15 papers)
- Jijie Li (11 papers)
- Dong Liang (154 papers)
- Yingli Zhao (5 papers)
- Yulong Ao (7 papers)
- Yaoqi Liu (4 papers)
- Fangxiang Feng (15 papers)
- Guang Liu (30 papers)