mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models (2408.04840v2)

Published 9 Aug 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Multi-modal LLMs (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal LLM, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal LLMs.

PDF HTML Abstract

Overview of mPLUG-Owl3: Enhancing Long Image-Sequence Understanding in Multi-Modal LLMs

The paper introduces mPLUG-Owl3, a Multi-Modal LLM (MLLM) designed to advance the state of long image-sequence understanding in multi-modal scenarios. The model integrates novel hyper attention blocks to efficiently integrate vision and language, thereby addressing several limitations of current multi-modal models, such as high computational overhead and loss of fine-grained visual information.

Key Contributions

The paper identifies several significant improvements introduced by mPLUG-Owl3:

Hyper Attention Blocks: Unlike traditional models which concatenate visual and text features or use cross-attention layers extensively within the network, mPLUG-Owl3 sparsely integrates hyper attention blocks. This design effectively fuses the textual and visual representations while preserving the fine-grained visual details and reducing computational costs.
Comprehensive Benchmarks: mPLUG-Owl3 has been evaluated across a diverse set of twenty benchmarks including single-image, multi-image, and video tasks. The model achieves state-of-the-art results on 14 out of 20 benchmarks, outperforming other models of similar size.
Distractor Resistance Evaluation: The authors propose a new evaluation metric named Distractor Resistance to assess model performance on ultra-long visual sequences, demonstrating mPLUG-Owl3's capability to maintain focus amidst visual distractions.

Model Architecture

mPLUG-Owl3 consists of a visual encoder, a linear projection layer, and a decoder-only LLM. It introduces the Hyper Attention Transformer Block (HATB), which performs interleaved cross-attention and self-attention, allowing for efficient incorporation of visual information into the text-based LLM.

Efficient Cross-Attention: The model employs cross-attention mechanisms parallel to self-attention to integrate visual and textual features. This setup involves reusing the language query to select and extract relevant visual features from the sequence, thus preserving fine-grained information and allowing for adaptive supplementation based on textual semantics.
Adaptive Gating and MI-Rope: The hyper attention blocks include adaptive gating mechanisms based on textual features and a Multimodal-Interleaved Rotary Position Embedding (MI-Rope) to maintain positional information of images within the sequence, ensuring effective contextual alignment.

Training Paradigm

The training approach for mPLUG-Owl3 is divided into three stages:

Pre-training: Utilizes large-scale image-text pairs for foundational multimodal alignment.
Multi-Image Training: Incorporates diverse datasets including interleaved image-text formats, text-rich images, and video captions to enhance multi-image understanding.
Supervised Fine-tuning: Fine-tunes the model using a mix of supervised datasets to optimize performance across tasks involving both single and multiple images.

Experimental Results

The experimental evaluation of mPLUG-Owl3 covers visual question answering, general MLLM benchmarks, and multi-image and video understanding tasks. The model consistently outperforms existing 8B-level MLLMs such as LLAVA-1.5, LLAVA-Next, and Idefics2 in various evaluations:

Visual Question Answering: Achieves superior accuracy on datasets like VQAv2, OK-VQA, GQA, and VizWizQA.
General Multi-Modal Tasks: Demonstrates the best performance on tasks in MMBench-EN/CN, MM-Vet, and POPE, showcasing its balanced capabilities in both textual and visual comprehension.
Multi-Image and Video Understanding: Surpasses existing models on benchmarks like NextQA, MVBench, and NLVR2, emphasizing its robustness and efficiency in long visual contexts.

Implications and Future Directions

The results presented in the paper suggest that mPLUG-Owl3 provides a strong foundation for further developments in multi-modal LLMs, particularly in handling long and complex visual sequences. The versatility and efficiency of its architecture indicate a potential shift towards more computationally efficient models that do not compromise on performance.

Future Developments:

Improvement in Visual Encoders: Fine-tuning the visual encoder could further enhance the model's ability to capture detailed visual information, especially for tasks requiring intricate visual distinctions like TextVQA.
Extension in Multi-Image Training: Expanding the diversity and scale of multi-image training data could improve the model's generalization capabilities in various multi-image scenarios.
Enhanced Cross-Modal Representation: Refining the adaptive gating mechanisms and positional embeddings could lead to even better integration of visual and textual information.

In conclusion, mPLUG-Owl3 sets a new benchmark for multi-modal LLMs by effectively addressing the challenges of long image-sequence understanding. Its innovative architecture and training paradigms promise to accelerate the development of more capable and efficient multi-modal models in the future.