M$^3$IT: A Large-Scale Dataset towards Multi-Modal Multilingual Instruction Tuning (2306.04387v2)
Abstract: Instruction tuning has significantly advanced LLMs such as ChatGPT, enabling them to align with human instructions across diverse tasks. However, progress in open vision-LLMs (VLMs) has been limited due to the scarcity of high-quality instruction datasets. To tackle this challenge and promote research in the vision-language field, we introduce the Multi-Modal, Multilingual Instruction Tuning (M$3$IT) dataset, designed to optimize VLM alignment with human instructions. Our M$3$IT dataset comprises 40 carefully curated datasets, including 2.4 million instances and 400 manually written task instructions, reformatted into a vision-to-text structure. Key tasks are translated into 80 languages with an advanced translation system, ensuring broader accessibility. M$3$IT surpasses previous datasets regarding task coverage, instruction number and instance scale. Moreover, we develop Ying-VLM, a VLM model trained on our M$3$IT dataset, showcasing its potential to answer complex questions requiring world knowledge, generalize to unseen video tasks, and comprehend unseen instructions in Chinese. We have open-sourced the dataset to encourage further research.
- Lei Li (1293 papers)
- Yuwei Yin (21 papers)
- Shicheng Li (23 papers)
- Liang Chen (360 papers)
- Peiyi Wang (48 papers)
- Shuhuai Ren (30 papers)
- Mukai Li (17 papers)
- Yazheng Yang (16 papers)
- Jingjing Xu (80 papers)
- Xu Sun (194 papers)
- Lingpeng Kong (134 papers)
- Qi Liu (485 papers)