Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data (2410.18558v1)

Published 24 Oct 2024 in cs.CL

Abstract: Vision-LLMs (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.

References (58)

Authors (19)

Shuhao Gu (21 papers)
Jialing Zhang (4 papers)
Siyuan Zhou (27 papers)
Kevin Yu (20 papers)
Zhaohu Xing (16 papers)
Liangdong Wang (10 papers)
Zhou Cao (2 papers)
Jintao Jia (1 paper)
Zhuoyi Zhang (4 papers)
Yixuan Wang (95 papers)
Zhenchong Hu (1 paper)
Bo-Wen Zhang (15 papers)
Jijie Li (11 papers)
Dong Liang (154 papers)
Yingli Zhao (5 papers)
Yulong Ao (7 papers)
Yaoqi Liu (4 papers)
Fangxiang Feng (15 papers)
Guang Liu (30 papers)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/javaeeeee1/status/1850837263430685143

https://twitter.com/arXivGPT/status/1851363949318426791

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data (2410.18558v1)

Summary

Related Papers

Tweets