InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model (2401.16420v1)

Published 29 Jan 2024 in cs.CV and cs.CL

Abstract: We introduce InternLM-XComposer2, a cutting-edge vision-LLM excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.

PDF Abstract

Introduction to LLMs

InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-LLMs (VLMs) represents a significant advancement in the field of VLMs. It excels in both comprehension of visual elements and text-image composition, offering highly customizable content creation across a wide spectrum of application contexts.

Partial LoRA and Data Foundation

The model's capabilities are amplified through two critical design elements. The first is the Partial LoRA (P-LoRA) which strategically applies additional LoRA parameters to image tokens, harmonizing capability in composition and comprehension. Secondly, high quality and diverse data foundation is essential. The dataset is expertly curated, being rich in complexity and multifaceted, varying from simple instruction adherence to customization of content with a plethora of materials.

Performance Benchmarks and Advances

InternLM-XComposer2’s performance across various benchmarks is noteworthy. It not only significantly surpasses existing open-source MLLMs but also competes with advanced models like GPT-4V and Gemini Pro, particularly excelling in free-form text-image composition demonstrated in the OpenCompass benchmark for evaluating the creativity of LLMs.

The Future of Vision-Language Understanding

The sophistication of InternLM-XComposer2 combined with robust methodologies such as Partial LoRA and a rich data foundation hold promise for the future of multimodal understanding. Its proficiency in nuanced perception, intricate reasoning, and knowledge integration place it at the forefront of VLM advancements, with potential applications ranging from content generation to AI-augmented creative endeavors.