Qwen2.5-VL: Advanced Vision-Language Model

Updated 16 November 2025

Qwen2.5-VL is a multimodal model that integrates a dynamic-resolution Vision Transformer with efficient window attention and MRoPE to support arbitrary spatial and temporal scales.
It addresses tasks from visual question answering to long-video comprehension using innovative pretraining and reinforcement fine-tuning methods.
Empirical results demonstrate state-of-the-art performance, rivaling proprietary systems like GPT-4o across diverse visual reasoning benchmarks.

Qwen2.5-VL is a family of large-scale vision–LLMs designed for high-fidelity multimodal understanding across tasks ranging from visual question answering (VQA), mathematical reasoning, and document parsing to long-video comprehension and agentic interaction. The core architectural innovation combines a dynamic-resolution Vision Transformer (ViT) backbone with efficient window attention and multimodal rotary positional embeddings (MRoPE), enabling native support for arbitrary spatial and temporal scales. Qwen2.5-VL models are released in 3B, 7B, and 72B parameter configurations, with the 72B variant achieving state-of-the-art results across a spectrum of visual reasoning benchmarks, directly rivalling proprietary systems such as GPT-4o and Claude 3.5 Sonnet. The technical advancements, training methodologies, and empirical behaviors of Qwen2.5-VL have been substantiated through technical and applied research (Bai et al., 19 Feb 2025, Kyem et al., 13 Oct 2025, Wang et al., 10 Apr 2025).

1. Model Family: Architecture and Multimodal Processing

The Qwen2.5-VL architecture centers around a custom ViT backbone configured for dynamic resolution:

Vision Transformer (ViT): D=1280 hidden size, L=32 layers, H=16 heads, patch size p=14 px. Window attention is applied to 28/32 layers, restricting full $\mathcal{O}(N^2)$ self-attention to four key layers and reducing overall spatial complexity to $\approx4/32\cdot N^2 + 28/32\cdot N\cdot w^2$ FLOPs (with $w$ =window length).
Vision–Language Merger: Aggregates each 2×2 patch block to form super-tokens, then maps features via a two-layer MLP to match the input dimensionality of the downstream LLM. Output channel sizes scale with model size: 2048 (3B), 3584 (7B), 8192 (72B).
LLM: Initialized from Qwen2.5, supporting sequence lengths up to 32,768, vocabulary size 151,646, with hidden sizes, layer counts, and intermediate widths scaling by model size:

Model	Hidden Size	Layers	Intermediate Size	KV Heads	Head Size
3B	2048	36	4864	2	128
7B	3584	28	18944	4	128
72B	8192	80	29568	8	128

Self-Attention & Positional Embeddings: 2D rotary embeddings encode spatial patch positions; MRoPE augments this with absolute time encoding for videos via decomposition into time $t$ , height $h$ , width $w$ ( $\sin$ / $\cos$ applied per coordinate).

This design supports full-scale event localization in videos and pixel-level grounding in images without reliance on fixed input sizes or normalization.

2. Dynamic Resolution and Temporal Event Encoding

Qwen2.5-VL’s dynamic-resolution ViT processes arbitrary-resolution inputs by resizing image dimensions $(H, W)$ to nearest multiples of 28 and segmenting into 14×14 patches. No coordinate normalization is performed, so all spatial grounding tasks are executed with reference to true pixel coordinates in the raw image. The effective token sequence length is:

$N(H,W) = \left\lfloor \frac{H}{14} \right\rfloor \cdot \left\lfloor \frac{W}{14} \right\rfloor$

For videos, absolute time encoding is applied: frame indices $i$ are mapped to second-based time-stamps $t_i$ , with MRoPE embedding each token as a triad $(t_i, h, w)$ for improved perception of temporal event relationships. This enables Qwen2.5-VL to ground actions or objects at specific times regardless of input FPS.

3. Pretraining, Fine-Tuning, and Data Efficiency

The Qwen2.5-VL models follow a three-stage pretraining sequence:

Visual Pretraining (ViT only): 1.5T tokens across image–captioning, knowledge recognition, OCR.
Multimodal Pretraining: 2T tokens with pure text, interleaved image/text, VQA, video grounding, agent data.
Long-Context Pretraining: 0.6T tokens, including long videos, transcripts, and documents; sequence lengths up to 32,768.

Total pretraining corpus approaches 4.1T tokens, encompassing over 10,000 object categories and multilingual OCR, synthetic and real forms, charts, diagrams, and UI agent scenarios.

Instruction-tuned models (“-Instruct”) undergo supervised fine-tuning on up to 1M multimodal instruction–response examples, optimizing next-token LM loss for prompt adherence.

The “SoTA with Less” protocol (Wang et al., 10 Apr 2025) introduces a reinforcement fine-tuning (RFT) regime guided by MCTS-based sample difficulty selection. By explicitly quantifying reasoning complexity ( $D(x)$ ) through the number of MCTS iterations a base VLM requires to solve each sample (with $D(x) > 5$ or “unsolved” cases selected), Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-72B-Instruct are efficiently improved using only 11k/7.5k challenging instances, respectively. RFT objective is Group Relative Policy Optimization (GRPO):

$J_{\mathrm{GRPO}(\theta)} = \mathbb{E}_{q,\{o_i\}} \left[ \frac1G\sum_{i=1}^G \frac1{|o_i|} \sum_{t=1}^{|o_i|} \min\left\{ r_{i,t}\hat A_{i,t}, \mathrm{clip}(r_{i,t},1-\epsilon,1+\epsilon)\hat A_{i,t}\right\} \right] - \beta\, D_{\mathrm{KL}}(\pi_\theta \Vert \pi_{\mathrm{pre}})$

with group size $G=32$ , clipping $\epsilon=0.2$ , KL weight $\beta=0.01$ .

4. Benchmarking and Empirical Performance

Qwen2.5-VL achieves strong baselines prior to RFT:

Model	MathVista	MathVision	MathVerse	MMMU	MMStar	MMBench	MM-Vet	AI2D	Avg
7B-Instruct	67.8%	23.6%	44.5%	50.6%	61.7%	80.7%	66.0%	82.6%	59.69
72B-Instruct	74.8%	39.8%	57.6%	70.2%	70.8%	88.6%	76.2%	88.5%	70.81

After MCTS-guided RFT (“ThinkLite-VL”):

ThinkLite-VL-7B: MathVista 75.1% (prev. 67.8%), average +7% across eight tasks, surpassing all open 7B models and outperforming larger models like Qwen2.5-VL-72B-Instruct and GPT-4o (MathVista: 63.8%).
ThinkLite-VL-72B: MathVista 79.7%, avg. +5.8% improvement, establishing new open-source SOTA.

On MMBench-EN, Qwen2.5-VL-72B matches or exceeds Claude 3.5 Sonnet, Gemini 1.5-Pro, and GPT-4o across counting, chart QA, document parsing, and open-vocabulary detection. Notable results include 93.6% (CountBench), 89.5% (ChartQA), 92.7% (RefCOCO_val). For agentic UI tasks, 68% success (MobileMiniWob++), and 35% (AndroidWorld) indicate robust integration of perception and action.

5. Specialized Applications and Agentic Control

Qwen2.5-VL supports structured extraction from complex documents, including invoices, forms, HTML tables, chemical formulas (QwenVL-HTML format), achieving 96.4% DocVQA accuracy and low normalized edit distance (0.226/0.324) on OmniDocBench EN/ZH. For video analysis, the system leverages frame-level absolute time encoding for sub-second event localization, demonstrated by a LongVideoBench_val score of 60.7% and temporal mIoU of 50.9 on Charades-STA.

In VQA for traffic safety (Kyem et al., 13 Oct 2025), the model's architecture exploits frame-wise dynamic resolution and absolute time encodings. Specialized fine-tuning via Low-Rank Adaptation (LoRA) isolates VQA optimization, mitigating “task interference” from joint captioning. Applied to the WTS dataset (AI City 2025), Qwen2.5-VL achieves 60.80% VQA accuracy, outperforming VideoLLaMA3 (58.61%) and joint-training baselines by +8.6%.

6. Scaling, Efficiency, and Deployment Scenarios

Window attention in the ViT backbone considerably lowers the computational burden—most spatial layers limit attention to local patches ( $O(N w^2)$ ), only a minority retain global context ( $O(N^2)$ ). The dynamic-resolution design removes the necessity for costly resampling or canonical normalization, permitting efficient and accurate tokenization of large, high-resolution images and videos.

Model sizes address deployment from edge–AI (3B; competitive with prior open 3B models), mid-range (7B; SoTA in visual reasoning at its size), to flagship (72B; matching or exceeding closed-source SOTA systems). All variants retain robust language competencies from the Qwen2.5 LLM initialization.

A plausible implication is that Qwen2.5-VL’s scalable architecture and temporal-spatial positional encoding mechanisms offer a blueprint for next-generation multimodal models supporting fine-grained perception, reasoning, and agentic manipulation, particularly where input dimensionality and content diversity challenge fixed-size architectures.

7. Summary of Capabilities and Future Considerations

Qwen2.5-VL integrates dynamic-resolution vision transformers, efficient window attention, and multimodal rotary positional embeddings to natively process arbitrary-size images and long video sequences. Substantial empirical improvements are achieved through MCTS-guided, difficulty-aware reinforcement fine-tuning, enabling data-efficient SOTA performance with minimal sample counts and no external knowledge distillation. Supported applications span VQA, mathematical reasoning, document parsing, chart/diagram analysis, object localization, agentic UI control, and long-duration event grounding. The model’s architecture and open-source performance figures position it as a reference point for future research on multimodal reasoning and data-efficient self-improvement methodologies (Bai et al., 19 Feb 2025, Wang et al., 10 Apr 2025, Kyem et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (3)

Qwen2.5-VL Technical Report (2025)

Task-Specific Dual-Model Framework for Comprehensive Traffic Safety Video Description and Analysis (2025)

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement (2025)

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-VL Model.