Qwen2.5-VL Technical Report (2502.13923v1)

Published 19 Feb 2025 in cs.CV and cs.CL

Abstract: We introduce Qwen2.5-VL, the latest flagship model of Qwen vision-language series, which demonstrates significant advancements in both foundational capabilities and innovative functionalities. Qwen2.5-VL achieves a major leap forward in understanding and interacting with the world through enhanced visual recognition, precise object localization, robust document parsing, and long-video comprehension. A standout feature of Qwen2.5-VL is its ability to localize objects using bounding boxes or points accurately. It provides robust structured data extraction from invoices, forms, and tables, as well as detailed analysis of charts, diagrams, and layouts. To handle complex inputs, Qwen2.5-VL introduces dynamic resolution processing and absolute time encoding, enabling it to process images of varying sizes and videos of extended durations (up to hours) with second-level event localization. This allows the model to natively perceive spatial scales and temporal dynamics without relying on traditional normalization techniques. By training a native dynamic-resolution Vision Transformer (ViT) from scratch and incorporating Window Attention, we reduce computational overhead while maintaining native resolution. As a result, Qwen2.5-VL excels not only in static image and document understanding but also as an interactive visual agent capable of reasoning, tool usage, and task execution in real-world scenarios such as operating computers and mobile devices. Qwen2.5-VL is available in three sizes, addressing diverse use cases from edge AI to high-performance computing. The flagship Qwen2.5-VL-72B model matches state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, particularly excelling in document and diagram understanding. Additionally, Qwen2.5-VL maintains robust linguistic performance, preserving the core language competencies of the Qwen2.5 LLM.

PDF Abstract

The paper introduces Qwen2.5-VL, the latest iteration in the Qwen vision-language series, emphasizing advancements in visual recognition, object localization, document parsing, and long-video comprehension. The model aims to establish a robust foundation for Large Vision-LLMs (LVLMs) and enhance real-world applications.

Key features of Qwen2.5-VL include:

Object localization using bounding boxes or points.
Structured data extraction from documents.
Analysis of charts, diagrams, and layouts.
Dynamic resolution processing and absolute time encoding for handling variable-size images and extended videos.

The model is available in three sizes: 3B, 7B, and 72B, with the flagship 72B model matching state-of-the-art models like GPT-4o and Claude 3.5 Sonnet, especially in document and diagram understanding. The smaller models (7B and 3B) outperform comparable competitors in resource-constrained environments. Qwen2.5-VL also maintains the linguistic performance of the Qwen2.5 LLM.

Approach

The Qwen2.5-VL architecture consists of three main components:

LLM: Uses the Qwen2.5 LLM as its foundation, modified with Multimodal Rotary Position Embedding Aligned to Absolute Time.
Vision Encoder: Employs a redesigned Vision Transformer (ViT) architecture with 2D-RoPE and window attention for native input resolutions.
MLP-based Vision-Language Merger: Compresses feature sequences before feeding them into the LLM.

Vision Encoder

To address computational load imbalances during training and inference, windowed attention is introduced in most layers of the ViT, ensuring computational cost scales linearly with the number of patches. Only four layers use full self-attention. The architecture uses RMSNorm and SwiGLU as the activation function.

Native Dynamic Resolution and Frame Rate

Qwen2.5-VL dynamically converts images of varying sizes into token sequences and uses actual image dimensions to represent spatial features. For video inputs, it incorporates dynamic frame rate (FPS) training and absolute time encoding, aligning MRoPE IDs with timestamps.

Multimodal Rotary Position Embedding Aligned to Absolute Time

The MRoPE decomposes position embedding into temporal, height, and width components. In Qwen2.5-VL, the temporal component of MRoPE is aligned with absolute time, enabling the model to learn consistent temporal alignment across videos with different FPS sampling rates.

Pre-Training

The pre-training dataset was expanded from 1.2 trillion tokens to 4.1 trillion tokens, constructed through cleaning raw web data and synthesizing data. It includes image captions, interleaved image-text data, OCR (Optical Character Recognition) data, visual knowledge, multimodal academic questions, localization data, document parsing data, video descriptions, video localization, and agent-based interaction data.

Training Recipe

A Vision Transformer (ViT) was trained from scratch using DataComp and in-house datasets. The pre-training process is divided into three phases:

Training only the ViT to align with the LLM.
Training all model parameters on multimodal image data.
Incorporating video and agent-based data with increased sequence length.

To balance computational loads, data samples are dynamically packed based on their input sequence lengths to the LLM.

Post-Training

The post-training alignment framework employs Supervised Fine-Tuning (SFT) and Direct Preference Optimization (DPO). SFT uses the ChatML format to structure instruction-following data, while DPO refines the model based on human preferences.

Instruction Data

The SFT phase uses a dataset of approximately 2 million entries, evenly distributed between pure text data and multimodal data. The dataset includes single-turn and multi-turn interactions, contextualized by scenarios ranging from single-image inputs to multi-image sequences.

Data Filtering Pipeline

A two-stage data filtering pipeline is implemented to enhance the quality of the SFT dataset:

Domain-Specific Categorization: Uses a classification model to categorize question-answer (QA) pairs into eight primary domains and 30 fine-grained subcategories.
Domain-Tailored Filtering: Integrates rule-based and model-based approaches to eliminate low-quality entries.

Rejection Sampling for Enhanced Reasoning

Rejection sampling refines the dataset to enhance reasoning capabilities, particularly for tasks requiring complex inference. An intermediate version of the Qwen2.5-VL model evaluates the generated responses against the ground truth, retaining only samples where the model's output matches the expected answers.

Training Recipe

The post-training process consists of SFT and DPO phases, with the ViT parameters frozen.

Experiments

The performance of Qwen2.5-VL is evaluated across a variety of datasets and compared with state-of-the-art models such as Claude-3.5-Sonnet-0620, GPT-4o-0513, InternVL2.5, and different sizes of Qwen2-VL.

Performance on Pure Text Tasks

Qwen2.5-VL exhibits leading performance on pure text tasks, including general tasks, mathematics and science tasks, coding tasks, and alignment tasks.

General Visual Question Answering

Qwen2.5-VL demonstrates state-of-the-art performance in various VQA tasks, subjective evaluations, multilingual scenarios, and multi-image questions. The smaller-scale versions of Qwen2.5-VL (7B and 3B) also exhibit highly competitive performance.

Document Understanding and OCR

Qwen2.5-VL models achieve impressive performance on OCR (Optical Character Recognition), chart, and document understanding benchmarks. For OCR-related parsing benchmarks, the Qwen2.5-VL-72B model sets a new state-of-the-art.

Spatial Understanding

Qwen2.5-VL achieves leading performance across different benchmarks from box-grounding and point-grounding to counting. The model demonstrates an ability to understand, locate, and reason about specific image details and shows progress in counting ability, achieving a leading accuracy on CountBench.

Video Understanding and Grounding

Qwen2.5-VL achieves remarkable results on LVBench and MLVU, which evaluate long-form video understanding capabilities. On the Charades-STA dataset, Qwen2.5-VL-72B achieves an mIoU score surpassing the performance of GPT-4o.

Agent

The performance of Qwen2.5-VL-72B demonstrates exceptional advancements across GUI grounding benchmarks. The results show that Qwen2.5-VL-72B can outperform the baselines on AndroidWorld and MobileMiniWob++ and achieve comparable performance on OSWorld in online evaluation without auxiliary marks.