Towards Robust Instruction Tuning on Multimodal Large Language Models (2402.14492v2)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: Fine-tuning LLMs on multi-task instruction-following data has been proven to be a powerful learning paradigm for improving their zero-shot capabilities on new tasks. Recent works about high-quality instruction-following data generation and selection require amounts of human labor to conceive model-understandable instructions for the given tasks and carefully filter the LLM-generated data. In this work, we introduce an automatic instruction augmentation method named INSTRAUG in multimodal tasks. It starts from a handful of basic and straightforward meta instructions but can expand an instruction-following dataset by 30 times. Results on two popular multimodal instructionfollowing benchmarks MULTIINSTRUCT and InstructBLIP show that INSTRAUG can significantly improve the alignment of multimodal LLMs (MLLMs) across 12 multimodal tasks, which is even equivalent to the benefits of scaling up training data multiple times.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (56)

Authors (3)

Wei Han (202 papers)
Hui Chen (298 papers)
Soujanya Poria (138 papers)

Towards Robust Instruction Tuning on Multimodal Large Language Models (2402.14492v2)

Related Papers