mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality (2304.14178v3)

Published 27 Apr 2023 in cs.CL, cs.CV, and cs.LG

Abstract: LLMs have demonstrated impressive zero-shot abilities on a variety of open-ended tasks, while recent research has also explored the use of LLMs for multi-modal generation. In this study, we introduce mPLUG-Owl, a novel training paradigm that equips LLMs with multi-modal abilities through modularized learning of foundation LLM, a visual knowledge module, and a visual abstractor module. This approach can support multiple modalities and facilitate diverse unimodal and multimodal abilities through modality collaboration. The training paradigm of mPLUG-Owl involves a two-stage method for aligning image and text, which learns visual knowledge with the assistance of LLM while maintaining and even improving the generation abilities of LLM. In the first stage, the visual knowledge module and abstractor module are trained with a frozen LLM module to align the image and text. In the second stage, language-only and multi-modal supervised datasets are used to jointly fine-tune a low-rank adaption (LoRA) module on LLM and the abstractor module by freezing the visual knowledge module. We carefully build a visually-related instruction evaluation set OwlEval. Experimental results show that our model outperforms existing multi-modal models, demonstrating mPLUG-Owl's impressive instruction and visual understanding ability, multi-turn conversation ability, and knowledge reasoning ability. Besides, we observe some unexpected and exciting abilities such as multi-image correlation and scene text understanding, which makes it possible to leverage it for harder real scenarios, such as vision-only document comprehension. Our code, pre-trained model, instruction-tuned models, and evaluation set are available at https://github.com/X-PLUG/mPLUG-Owl. The online demo is available at https://www.modelscope.cn/studios/damo/mPLUG-Owl.

PDF Abstract

Overview of mPLUG-Owl: Modularization for Multi-Modality in LLMs

The paper "mPLUG-Owl: Modularization Empowers LLMs with Multimodality" explores an innovative paradigm for enhancing LLMs with multimodal capabilities through a modularized training approach. This paper addresses the integration of visual information into LLMs, a challenging area given the traditionally text-centric focus of such models.

Methodology

mPLUG-Owl employs a modularized architecture comprising three core components: a foundation LLM, a visual knowledge module, and a visual abstractor module. This structure allows for effective alignment and integration of image and text data. The training process is divided into two stages:

Visual Knowledge and Alignment: The initial stage involves training the visual modules with a frozen LLM to align visual and textual data. This step ensures the acquisition and internalization of visual knowledge using text-image pairs.
Joint Instruction Tuning: The second stage utilizes a combination of language-only and multimodal datasets to fine-tune a low-rank adaption module on the LLM and the visual abstractor, with the visual knowledge module remaining frozen. This phase enhances both unimodal and multimodal capabilities, promoting robust cognitive abilities in the model.

Experimental Results

Upon evaluation using the custom-built OwlEval dataset, mPLUG-Owl demonstrated superior performance compared to other multimodal models like MiniGPT-4 and LLaVA. Key areas of testing included instruction understanding, visual comprehension, knowledge reasoning, and multi-turn conversation. The model also exhibited unexpected abilities in handling tasks such as multi-image correlation and scene text understanding, suggesting potential for more complex applications like vision-only document comprehension.

The paper provides a detailed comparison of mPLUG-Owl’s performance with existing models, showcasing its proficiency across various metrics.

Implications and Future Directions

The introduction of mPLUG-Owl represents a significant advancement in bridging the gap between visual and textual modalities within LLMs. The use of modularization not only empowers these models with enhanced visual understanding but also preserves and potentially improves their generative capabilities. This approach opens avenues for more sophisticated AI systems capable of handling intricate multimodal instructions and scenarios.

Future research may explore extending the current model’s capabilities further, particularly in areas like multilingual processing and complex scene interpretation. As the field progresses, such modular frameworks could become foundational in the development of AI systems that interact seamlessly across various forms of media and language.

Conclusion

The modular training paradigm of mPLUG-Owl presents a promising direction for integrating multimodality into LLMs. By maintaining the strengths of LLMs and augmenting them with powerful visual modules, this research contributes significantly to the advancement of AI technologies capable of understanding and generating across multiple modalities.

PDF Markdown Bookmark Chat (Pro)

Authors (18)

Qinghao Ye (31 papers)
Haiyang Xu (67 papers)
Guohai Xu (21 papers)
Jiabo Ye (17 papers)
Ming Yan (190 papers)
Yiyang Zhou (33 papers)
Junyang Wang (24 papers)
Anwen Hu (22 papers)
Pengcheng Shi (24 papers)
Yaya Shi (13 papers)
Chenliang Li (92 papers)
Yuanhong Xu (11 papers)
Hehong Chen (10 papers)
Junfeng Tian (19 papers)
Ji Zhang (176 papers)
Fei Huang (408 papers)
Qi Qian (54 papers)
Jingren Zhou (198 papers)

Citations (753)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - X-PLUG/mPLUG-Owl: mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model (2,235 stars)