Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2308.12966v3)

Published 24 Aug 2023 in cs.CV and cs.CL

Abstract: In this work, we introduce the Qwen-VL series, a set of large-scale vision-LLMs (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

PDF Abstract

An Overview of Qwen-VL: Advancements in Vision-LLMs

The paper presents Qwen-VL, a series of Large Vision-LLMs (LVLMs) that integrate advanced visual perception with language capabilities. Built on the foundational Qwen-LM, these models are equipped to undertake tasks such as image understanding, localization, and text reading. The authors employ a comprehensive methodology to train the Qwen-VL models, demonstrating significant performance improvements in various benchmarks.

Model Design and Architecture

The architecture of Qwen-VL is built around three core components: a LLM, a Visual Encoder based on Vision Transformer (ViT), and a Position-aware Vision-Language Adapter. The LLM is initialized with pre-trained weights from Qwen-7B, ensuring a robust language processing foundation. The visual encoder is fine-tuned for efficient image processing, and the adapter facilitates seamless integration of visual features into the LLM, enabling fine-grained image comprehension.

Comprehensive Training Methodology

The training of Qwen-VL is divided into a three-stage pipeline:

Pre-Training: Utilizes a vast dataset of image-text pairs for foundational learning. This stage focuses on optimizing the visual receptor and adapter components.
Multi-Task Pre-Training: Incorporates fine-grained and multilingual datasets across multiple tasks such as VQA, captioning, and grounding, enhancing the model's versatility.
Supervised Fine-Tuning: Instruction tuning is applied to refine the model’s interactive capabilities, transitioning Qwen-VL into the Qwen-VL-Chat variant. This stage improves real-world instruction-following and dialogue capabilities.

Evaluation and Results

The Qwen-VL models demonstrate superior performance across a variety of benchmarks:

Image Captioning and General VQA: Achieves top-tier results, outperforming other generalist models. For instance, on Flickr30K, Qwen-VL scores 85.8 CIDEr, establishing itself as a leader in zero-shot image captioning.
Text-Oriented VQA: Shows strong capabilities in handling text within images, as evidenced by results on datasets like TextVQA and DocVQA.
Refer Expression Comprehension: Excels in tasks requiring precise object localization, surpassing existing models in accuracy on datasets like RefCOCO.
Instruction Following: In benchmarks like TouchStone and SEED-Bench, Qwen-VL-Chat demonstrates an ability to handle complex multimodal instructions effectively.

Implications and Future Directions

Qwen-VL's approach highlights several implications for advancing Vision-LLMs:

Multimodal Integration: Successfully integrating visual perception into LLMs points toward more sophisticated AI systems capable of comprehensive interaction with diverse data forms.
Efficiency in Training: The structured training pipeline illustrates effective strategies for maximizing model capabilities with large datasets.

Looking forward, the exploration of integrating more modalities such as video and speech, alongside scaling model size and training data, presents avenues for further advancements. The ongoing research aims to enhance the model’s ability in generating complex data types, indicating potential expansions in the scope of AI applications.

By releasing these models openly, the authors facilitate future research and development, encouraging the exploration of LVLMs in both academic and practical settings. This work stands as a testament to the continual evolution of AI capabilities, providing a strong foundation for more nuanced and integrated multimodal systems.