Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution (2409.12191v2)

Published 18 Sep 2024 in cs.CV, cs.AI, and cs.CL

Abstract: We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-LLMs (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at https://github.com/QwenLM/Qwen2-VL .

PDF Abstract

Qwen2-VL: Advancements in Large Vision-LLMs

The research paper titled "Qwen2-VL: Enhancing Vision-LLM's Perception of the World at Any Resolution" presents the Qwen2-VL series, a sophisticated improvement over the earlier Qwen-VL models. The paper elucidates several key advancements in the domain of large vision-LLMs (LVLMs), focusing on dynamic image resolution processing and the integration of multimodal inputs for enhanced perception and interpretation capabilities.

Introduction and Objectives

The Qwen2-VL series aims to address several inherent limitations in existing LVLMs, particularly their reliance on predetermined image resolutions and static visual encoders. The paper details the development of the Naive Dynamic Resolution mechanism and the Multimodal Rotary Position Embedding (M-RoPE), mechanisms that facilitate better image processing and position embedding across various modalities, including text, images, and videos.

Innovations in Model Architecture

Naive Dynamic Resolution

The dynamic resolution mechanism allows the Qwen2-VL models to process images of varying resolutions dynamically, which is crucial for capturing details at multiple scales and enhancing the model's perceptual abilities akin to human vision. This method employs a 2D-RoPE integrated Vision Transformer (ViT) that compresses visual tokens for efficient processing without a fixed input resolution.

Experimental results demonstrate that dynamic resolution processing not only retains more detailed information from high-resolution images but also optimizes token usage, thereby improving the model's efficiency.

Multimodal Rotary Position Embedding (M-RoPE)

The M-RoPE mechanism introduces an innovative positional encoding strategy by decomposing rotary embeddings into temporal, height, and width components. This approach allows Qwen2-VL models to effectively handle the three-dimensional nature of visual content and temporal dynamics in videos, surpassing the capabilities of traditional one-dimensional position embeddings.

Evaluations indicate that M-RoPE substantially enhances model performance in video comprehension tasks, supporting longer sequence lengths extrapolated beyond training maximums, thus validating its effectiveness in capturing complex spatiotemporal information.

Model Scaling and Comprehensive Evaluation

The Qwen2-VL series includes models with varying parameter sizes—2B, 8B, and 72B—to explore the impact of scaling on model performance. Comparative evaluations against state-of-the-art models across multiple benchmarks reveal that Qwen2-VL models consistently achieve or surpass leading performances, particularly in document understanding, multilingual OCR, and general visual perception tasks.

Multimodal Benchmarks and Robust Performance

Qwen2-VL-72B demonstrates highly competitive performance on various benchmarks, including DocVQA, InfoVQA, AI2D, ChartQA, and TextVQA, as well as video comprehension benchmarks like MVBench and PerceptionTest. The model's ability to comprehend and reason over extended-duration video content is particularly noteworthy, making it suitable for complex, high-resolution visual tasks.

The paper also highlights the model's enhanced OCR capabilities, outperforming other LVLMs in multilingual OCR tasks, which underscores the model's proficiency in recognizing and understanding texts across diverse languages and scripts.

Practical Implications and Future Directions

The advanced capabilities of the Qwen2-VL series open several practical applications, ranging from mobile device operations and robotic control to more sophisticated tasks involving multi-step decision-making and agent-based interactions. The model's integration with real-world devices like mobile phones and robots showcases its potential in autonomous operation based on visual and textual inputs.

Scalability and Efficient Training

The research underscores the importance of scalability in LVLMs, demonstrating that both model size and the amount of training data significantly influence performance across various dimensions. The robust infrastructure support from Alibaba Cloud, coupled with optimized storage and parallelism techniques, has been pivotal in efficiently training these large models.

Conclusion

The Qwen2-VL series represents a significant stride in the evolution of large vision-LLMs. By introducing dynamic resolution processing and M-RoPE, the Qwen2-VL models enhance visual perception and comprehension capabilities across resolutions and modalities. The model's open-source availability fosters further research and development, promoting advancements in AI applications.

The paper's contributions chart a path for future investigations into extending sequence lengths, optimizing dynamic resolution strategies, and further exploring the scaling laws for large multimodal models. As such, the Qwen2-VL series stands as a formidable framework for advancing AI's capability to mimic human-like perceptual and cognitive processes.