InternLM-XComposer-2.5: A Versatile Large Vision Language Model Supporting Long-Contextual Input and Output (2407.03320v1)

Published 3 Jul 2024 in cs.CV and cs.CL

Abstract: We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision LLM that supports long-contextual input and output. IXC-2.5 excels in various text-image comprehension and composition applications, achieving GPT-4V level capabilities with merely 7B LLM backend. Trained with 24K interleaved image-text contexts, it can seamlessly extend to 96K long contexts via RoPE extrapolation. This long-context capability allows IXC-2.5 to excel in tasks requiring extensive input and output contexts. Compared to its previous 2.0 version, InternLM-XComposer-2.5 features three major upgrades in vision-language comprehension: (1) Ultra-High Resolution Understanding, (2) Fine-Grained Video Understanding, and (3) Multi-Turn Multi-Image Dialogue. In addition to comprehension, IXC-2.5 extends to two compelling applications using extra LoRA parameters for text-image composition: (1) Crafting Webpages and (2) Composing High-Quality Text-Image Articles. IXC-2.5 has been evaluated on 28 benchmarks, outperforming existing open-source state-of-the-art models on 16 benchmarks. It also surpasses or competes closely with GPT-4V and Gemini Pro on 16 key tasks. The InternLM-XComposer-2.5 is publicly available at https://github.com/InternLM/InternLM-XComposer.

PDF HTML Abstract

InternLM-XComposer-2.5: A Versatile Large Vision LLM Supporting Long-Contextual Input and Output

The presented paper discusses InternLM-XComposer-2.5 (IXC-2.5), a significant advancement in Large Vision LLMs (LVLMs) that emphasizes long-contextual input and output capabilities, enabling a multitude of sophisticated applications across text-image comprehension and composition. This model represents a substantial progress over its predecessor, IXC-2.0, primarily due to its enhanced architecture and expanded capabilities.

Key Model Enhancements

IXC-2.5 incorporates three primary enhancements aimed at advancing vision-language comprehension:

Ultra-High Resolution Understanding: Utilizing a 560 × 560 Vision Transformer (ViT) encoder, IXC-2.5 enables the processing of high-resolution images with various aspect ratios.
Fine-Grained Video Understanding: Videos are treated as high-resolution composite images comprising numerous frames, capturing fine details via dense sampling of each frame.
Multi-Turn Multi-Image Dialogue: The model supports extended, complex interactions involving multiple images over many turns, improving the fluidity of human-like conversations.

Furthermore, IXC-2.5 extends its capabilities to two critical text-image composition applications:

Crafting Webpages: Leveraging additional LoRA parameters, IXC-2.5 can generate source codes for webpages from text-image instructions.
Composing High-Quality Text-Image Articles: Implementing Chain-of-Thought (CoT) and Direct Preference Optimization (DPO) techniques, IXC-2.5 produces high-quality written content with corresponding images.

Training and Model Architecture

The IXC-2.5 model emphasizes long-contextual interaction, trained with 24,000 interleaved image-text contexts and capable of extending to 96,000 contexts via RoPE extrapolation. The architecture includes a Vision Encoder OpenAI ViT-L/14, a LLM InternLM2-7B, and Partial LoRA for alignment with vision encoders.

The pre-training phase focuses on three tasks using a diverse dataset: General Semantic Alignment, World Knowledge Alignment, and Vision Capability Enhancement. This preparatory phase ensures the model's adeptness in processing various vision-language inputs.

Benchmark Performance

IXC-2.5 demonstrates state-of-the-art performance across a wide range of benchmarks:

Video Understanding: Outperformed existing models on four out of five benchmarks, including MVBench and MME-Video, delineating its proficiency in fine-grained video tasks.
Structural High-Resolution Benchmarks: Achieved notable results on DocVQA, ChartQA, TextVQA, demonstrating its capacity to handle complex visual information.
General Visual QA Benchmarks: Excelled in MMStar, RealWorldQA, and others, showcasing its versatility.
Multi-Image Multi-Turn Dialogue: Surpassed prior models in MMDU, highlighting advanced conversational abilities.

Webpage Generation and Article Composition

In terms of webpage generation, IXC-2.5 extends its capabilities to:

Screenshot-to-code: Achieved high scores on the Design2Code benchmark, demonstrating near GPT-4v level performance in translating visual designs into code.
Instruction-Aware Webpage Generation: Trained on synthetic and real-world datasets to convert textual instructions into webpage designs, including interactive elements like JavaScript.
Resume-to-homepage: Created personal homepages from resumes, showcasing practical applicability.

For article composition, the model utilizes a multi-step pipeline including supervision, reward modeling, preference data collection, and DPO alignment, resulting in stable and high-quality text-image articles.

Implications and Future Directions

The enhancements and comprehensive capabilities of IXC-2.5 make it a robust tool for a variety of practical applications, from webpage design to intricate visual QA tasks. The model’s ability to handle long-contextual interactions positions it as a pivotal advancement for future AI developments. Future research can extend IXC-2.5’s long-context capabilities to more complex and extended multi-modal environments, such as continuous video streams or prolonged dialogue histories, thus broadening its applicability in real-world scenarios.