EVLM: An Efficient Vision-Language Model for Visual Understanding (2407.14177v1)

Published 19 Jul 2024 in cs.CV

Abstract: In the field of multi-modal LLMs, the majority of methods are built on an architecture similar to LLaVA. These models use a single-layer ViT feature as a visual prompt, directly feeding it into the LLMs alongside textual tokens. However, when dealing with long sequences of visual signals or inputs such as videos, the self-attention mechanism of LLMs can lead to significant computational overhead. Additionally, using single-layer ViT features makes it challenging for LLMs to perceive visual signals fully. This paper proposes an efficient multi-modal LLM to minimize computational costs while enabling the model to perceive visual signals as comprehensively as possible. Our method primarily includes: (1) employing cross-attention to image-text interaction similar to Flamingo. (2) utilize hierarchical ViT features. (3) introduce the Mixture of Experts (MoE) mechanism to enhance model effectiveness. Our model achieves competitive scores on public multi-modal benchmarks and performs well in tasks such as image captioning and video captioning.

PDF HTML Abstract

EVLM: An Efficient Vision-LLM for Visual Understanding

The research paper titled "EVLM: An Efficient Vision-LLM for Visual Understanding" introduces an optimized vision-LLM that aims to advance visual understanding tasks while significantly reducing computational overhead. The model addresses prevailing challenges in the integration of visual and textual information, particularly when dealing with lengthy sequences, by leveraging a combination of hierarchical Vision Transformer (ViT) features, the Mixture of Experts (MoE) mechanism, and a cross-attention strategy inspired by Flamingo.

Overview of the Model Architecture

The core architecture integrates three primary components:

Visual Encoder: The model employs the EVA2-CLIP-E-Plus, from which the normalization and head layers are removed post the final transformer block. Hierarchical features are extracted uniformly from various layers, facilitating multi-grained visual representation.
Gated Cross Attention Layer: To achieve effective interaction between textual tokens and visual features, this layer ensures computational efficiency even with extended visual token sequences. It involves learnable tokens that act akin to Q-former, optimized for substantial feature interaction.
LLM: Utilizing the Qwen-14B-Chat 1.0, the model integrates cross-attention layers prior to each transformer layer of the LLM. This setup aims to seamlessly align visual cues with text processing.

Training Process

The training procedure is divided into three distinct stages:

Multi-modal Pre-training: This stage focuses on cross-modal alignment and builds intrinsic relationships within multi-modal data. Leveraging a dataset of 2.5 billion image-text pairs and 50 million web-type multi-modal data, coupled with specific pre-processing steps, this stage ensures a robust initial model build.
Multi-task Continual Pre-training: Targeted at enhancing high-level visual question-answering capabilities, this stage incorporates data from diverse domains, including VQA, OCR, and NLP, to bear comprehensive knowledge. Here, the image resolution is increased, and only critical layers are unfrozen for optimized training.
Supervised Fine-Tuning: The final stage employs instruction fine-tuning with 2.3 million samples from high-quality datasets, focusing on refining the model's instruction-following abilities. Additionally, an MoE configuration is applied in the Gated XAttention layer allowing the fine-grained scaling needed for improved performance.

Evaluation and Performance

Empirical evaluation of EVLM across various benchmarks exhibits remarkable performance improvements over existing models:

General VQA Benchmarks: The model demonstrates superior accuracy on tasks such as ScienceQA, outperforming models using higher resolution inputs.
Text-oriented VQA Benchmarks: EVLM shows robust understanding of intricate text details within images, particularly excelling in datasets like AI2Diagram.
General Multimodal Benchmarks: Demonstrating its adaptability, EVLM outperforms on MME, MMB, and POPE benchmarks, attesting to its efficient multimodal fusion capabilities.

Implications and Future Directions

The implications of this research are manifold. Practically, the model's efficient handling of extensive visual tokens makes it highly suitable for real-world applications requiring robust visual and textual integration, such as automated document analysis, detailed image captioning, and video understanding. Theoretically, the hierarchical feature integration and MoE-based scaling provide insightful pathways for enhancing both the depth and breadth of LLM capabilities in vision-language tasks.

Future research directions could explore the integration of larger, more powerful LLMs, further experimentation with cross-attention mechanisms to handle extremely long video sequences, and augmentation of multimodal datasets to cover broader contextual nuances. The continuous optimization of computational efficiency while retaining rich multimodal feature representation remains a key area for developing more capable and scalable vision-LLMs.

PDF Markdown Bookmark Chat (Pro)

Authors (17)

Kaibing Chen (4 papers)
Dong Shen (14 papers)
Hanwen Zhong (3 papers)
Huasong Zhong (9 papers)
Kui Xia (1 paper)
Di Xu (47 papers)
Wei Yuan (110 papers)
Yifei Hu (13 papers)
Bin Wen (34 papers)
Tianke Zhang (13 papers)
Changyi Liu (7 papers)
Dewen Fan (2 papers)
Huihui Xiao (5 papers)
Jiahong Wu (50 papers)
Fan Yang (877 papers)
Size Li (8 papers)
Di Zhang (230 papers)

Citations (5)

View on Semantic Scholar

Related Papers

Find Related Papers

Tweets

https://twitter.com/_akhaliq/status/1815220218207129954