The paper introduces Qwen2.5-VL, the latest iteration in the Qwen vision-language series, emphasizing advancements in visual recognition, object localization, document parsing, and long-video comprehension. The model aims to establish a robust foundation for Large Vision-LLMs (LVLMs) and enhance real-world applications.
Key features of Qwen2.5-VL include:
- Object localization using bounding boxes or points.
- Structured data extraction from documents.
- Analysis of charts, diagrams, and layouts.
- Dynamic resolution processing and absolute time encoding for handling variable-size images and extended videos.
The model is available in three sizes: 3B, 7B, and 72B, with the flagship 72B model matching state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, especially in document and diagram understanding. The smaller models (7B and 3B) outperform comparable competitors in resource-constrained environments. Qwen2.5-VL also maintains the linguistic performance of the Qwen2.5 LLM.
Approach
The Qwen2.5-VL architecture consists of three main components:
- LLM: Uses the Qwen2.5 LLM as its foundation, modified with Multimodal Rotary Position Embedding Aligned to Absolute Time.
- Vision Encoder: Employs a redesigned Vision Transformer (ViT) architecture with 2D-RoPE and window attention for native input resolutions.
- MLP-based Vision-Language Merger: Compresses feature sequences before feeding them into the LLM.
Vision Encoder
To address computational load imbalances during training and inference, windowed attention is introduced in most layers of the ViT, ensuring computational cost scales linearly with the number of patches. Only four layers use full self-attention. The architecture uses RMSNorm and SwiGLU as the activation function.
Native Dynamic Resolution and Frame Rate
Qwen2.5-VL dynamically converts images of varying sizes into token sequences and uses actual image dimensions to represent spatial features. For video inputs, it incorporates dynamic frame rate (FPS) training and absolute time encoding, aligning MRoPE IDs with timestamps.
Multimodal Rotary Position Embedding Aligned to Absolute Time
The MRoPE decomposes position embedding into temporal, height, and width components. In Qwen2.5-VL, the temporal component of MRoPE is aligned with absolute time, enabling the model to learn consistent temporal alignment across videos with different FPS sampling rates.
Pre-Training
The pre-training dataset was expanded from 1.2 trillion tokens to 4.1 trillion tokens, constructed through cleaning raw web data and synthesizing data. It includes image captions, interleaved image-text data, OCR (Optical Character Recognition) data, visual knowledge, multimodal academic questions, localization data, document parsing data, video descriptions, video localization, and agent-based interaction data.
Training Recipe
A Vision Transformer (ViT) was trained from scratch using DataComp and in-house datasets. The pre-training process is divided into three phases:
- Training only the ViT to align with the LLM.
- Training all model parameters on multimodal image data.
- Incorporating video and agent-based data with increased sequence length.
To balance computational loads, data samples are dynamically packed based on their input sequence lengths to the LLM.
Post-Training
The post-training alignment framework employs Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). SFT uses the ChatML format to structure instruction-following data, while DPO refines the model based on human preferences.
Instruction Data
The SFT phase uses a dataset of approximately 2 million entries, evenly distributed between pure text data and multimodal data. The dataset includes single-turn and multi-turn interactions, contextualized by scenarios ranging from single-image inputs to multi-image sequences.
Data Filtering Pipeline
A two-stage data filtering pipeline is implemented to enhance the quality of the SFT dataset:
- Domain-Specific Categorization: Uses a classification model to categorize question-answer (QA) pairs into eight primary domains and 30 fine-grained subcategories.
- Domain-Tailored Filtering: Integrates rule-based and model-based approaches to eliminate low-quality entries.
Rejection Sampling for Enhanced Reasoning
Rejection sampling refines the dataset to enhance reasoning capabilities, particularly for tasks requiring complex inference. An intermediate version of the Qwen2.5-VL model evaluates the generated responses against the ground truth, retaining only samples where the model's output matches the expected answers.
Training Recipe
The post-training process consists of SFT and DPO phases, with the ViT parameters frozen.
Experiments
The performance of Qwen2.5-VL is evaluated across a variety of datasets and compared with state-of-the-art models such as Claude-3.5-Sonnet-0620, GPT-4o-0513, InternVL2.5, and different sizes of Qwen2-VL.
Performance on Pure Text Tasks
Qwen2.5-VL exhibits leading performance on pure text tasks, including general tasks, mathematics and science tasks, coding tasks, and alignment tasks.
General Visual Question Answering
Qwen2.5-VL demonstrates state-of-the-art performance in various VQA tasks, subjective evaluations, multilingual scenarios, and multi-image questions. The smaller-scale versions of Qwen2.5-VL (7B and 3B) also exhibit highly competitive performance.
Document Understanding and OCR
Qwen2.5-VL models achieve impressive performance on OCR (Optical Character Recognition), chart, and document understanding benchmarks. For OCR-related parsing benchmarks, the Qwen2.5-VL-72B model sets a new state-of-the-art.
Spatial Understanding
Qwen2.5-VL achieves leading performance across different benchmarks from box-grounding and point-grounding to counting. The model demonstrates an ability to understand, locate, and reason about specific image details and shows progress in counting ability, achieving a leading accuracy on CountBench.
Video Understanding and Grounding
Qwen2.5-VL achieves remarkable results on LVBench and MLVU, which evaluate long-form video understanding capabilities. On the Charades-STA dataset, Qwen2.5-VL-72B achieves an mIoU score surpassing the performance of GPT-4o.
Agent
The performance of Qwen2.5-VL-72B demonstrates exceptional advancements across GUI grounding benchmarks. The results show that Qwen2.5-VL-72B can outperform the baselines on AndroidWorld and MobileMiniWob++ and achieve comparable performance on OSWorld in online evaluation without auxiliary marks.