Qwen2.5-VL: Advanced Vision-Language Model
- Qwen2.5-VL is a multimodal model that integrates a dynamic-resolution Vision Transformer with efficient window attention and MRoPE to support arbitrary spatial and temporal scales.
- It addresses tasks from visual question answering to long-video comprehension using innovative pretraining and reinforcement fine-tuning methods.
- Empirical results demonstrate state-of-the-art performance, rivaling proprietary systems like GPT-4o across diverse visual reasoning benchmarks.
Qwen2.5-VL is a family of large-scale vision–LLMs designed for high-fidelity multimodal understanding across tasks ranging from visual question answering (VQA), mathematical reasoning, and document parsing to long-video comprehension and agentic interaction. The core architectural innovation combines a dynamic-resolution Vision Transformer (ViT) backbone with efficient window attention and multimodal rotary positional embeddings (MRoPE), enabling native support for arbitrary spatial and temporal scales. Qwen2.5-VL models are released in 3B, 7B, and 72B parameter configurations, with the 72B variant achieving state-of-the-art results across a spectrum of visual reasoning benchmarks, directly rivalling proprietary systems such as GPT-4o and Claude 3.5 Sonnet. The technical advancements, training methodologies, and empirical behaviors of Qwen2.5-VL have been substantiated through technical and applied research (Bai et al., 19 Feb 2025, Kyem et al., 13 Oct 2025, Wang et al., 10 Apr 2025).
1. Model Family: Architecture and Multimodal Processing
The Qwen2.5-VL architecture centers around a custom ViT backbone configured for dynamic resolution:
- Vision Transformer (ViT): D=1280 hidden size, L=32 layers, H=16 heads, patch size p=14 px. Window attention is applied to 28/32 layers, restricting full self-attention to four key layers and reducing overall spatial complexity to FLOPs (with =window length).
- Vision–Language Merger: Aggregates each 2×2 patch block to form super-tokens, then maps features via a two-layer MLP to match the input dimensionality of the downstream LLM. Output channel sizes scale with model size: 2048 (3B), 3584 (7B), 8192 (72B).
- LLM: Initialized from Qwen2.5, supporting sequence lengths up to 32,768, vocabulary size 151,646, with hidden sizes, layer counts, and intermediate widths scaling by model size:
| Model | Hidden Size | Layers | Intermediate Size | KV Heads | Head Size |
|---|---|---|---|---|---|
| 3B | 2048 | 36 | 4864 | 2 | 128 |
| 7B | 3584 | 28 | 18944 | 4 | 128 |
| 72B | 8192 | 80 | 29568 | 8 | 128 |
- Self-Attention & Positional Embeddings: 2D rotary embeddings encode spatial patch positions; MRoPE augments this with absolute time encoding for videos via decomposition into time , height , width (/ applied per coordinate).
This design supports full-scale event localization in videos and pixel-level grounding in images without reliance on fixed input sizes or normalization.
2. Dynamic Resolution and Temporal Event Encoding
Qwen2.5-VL’s dynamic-resolution ViT processes arbitrary-resolution inputs by resizing image dimensions to nearest multiples of 28 and segmenting into 14×14 patches. No coordinate normalization is performed, so all spatial grounding tasks are executed with reference to true pixel coordinates in the raw image. The effective token sequence length is:
For videos, absolute time encoding is applied: frame indices are mapped to second-based time-stamps , with MRoPE embedding each token as a triad for improved perception of temporal event relationships. This enables Qwen2.5-VL to ground actions or objects at specific times regardless of input FPS.
3. Pretraining, Fine-Tuning, and Data Efficiency
The Qwen2.5-VL models follow a three-stage pretraining sequence:
- Visual Pretraining (ViT only): 1.5T tokens across image–captioning, knowledge recognition, OCR.
- Multimodal Pretraining: 2T tokens with pure text, interleaved image/text, VQA, video grounding, agent data.
- Long-Context Pretraining: 0.6T tokens, including long videos, transcripts, and documents; sequence lengths up to 32,768.
Total pretraining corpus approaches 4.1T tokens, encompassing over 10,000 object categories and multilingual OCR, synthetic and real forms, charts, diagrams, and UI agent scenarios.
Instruction-tuned models (“-Instruct”) undergo supervised fine-tuning on up to 1M multimodal instruction–response examples, optimizing next-token LM loss for prompt adherence.
The “SoTA with Less” protocol (Wang et al., 10 Apr 2025) introduces a reinforcement fine-tuning (RFT) regime guided by MCTS-based sample difficulty selection. By explicitly quantifying reasoning complexity () through the number of MCTS iterations a base VLM requires to solve each sample (with or “unsolved” cases selected), Qwen2.5-VL-7B-Instruct and Qwen2.5-VL-72B-Instruct are efficiently improved using only 11k/7.5k challenging instances, respectively. RFT objective is Group Relative Policy Optimization (GRPO):
with group size , clipping , KL weight .
4. Benchmarking and Empirical Performance
Qwen2.5-VL achieves strong baselines prior to RFT:
| Model | MathVista | MathVision | MathVerse | MMMU | MMStar | MMBench | MM-Vet | AI2D | Avg |
|---|---|---|---|---|---|---|---|---|---|
| 7B-Instruct | 67.8% | 23.6% | 44.5% | 50.6% | 61.7% | 80.7% | 66.0% | 82.6% | 59.69 |
| 72B-Instruct | 74.8% | 39.8% | 57.6% | 70.2% | 70.8% | 88.6% | 76.2% | 88.5% | 70.81 |
After MCTS-guided RFT (“ThinkLite-VL”):
- ThinkLite-VL-7B: MathVista 75.1% (prev. 67.8%), average +7% across eight tasks, surpassing all open 7B models and outperforming larger models like Qwen2.5-VL-72B-Instruct and GPT-4o (MathVista: 63.8%).
- ThinkLite-VL-72B: MathVista 79.7%, avg. +5.8% improvement, establishing new open-source SOTA.
On MMBench-EN, Qwen2.5-VL-72B matches or exceeds Claude 3.5 Sonnet, Gemini 1.5-Pro, and GPT-4o across counting, chart QA, document parsing, and open-vocabulary detection. Notable results include 93.6% (CountBench), 89.5% (ChartQA), 92.7% (RefCOCO_val). For agentic UI tasks, 68% success (MobileMiniWob++), and 35% (AndroidWorld) indicate robust integration of perception and action.
5. Specialized Applications and Agentic Control
Qwen2.5-VL supports structured extraction from complex documents, including invoices, forms, HTML tables, chemical formulas (QwenVL-HTML format), achieving 96.4% DocVQA accuracy and low normalized edit distance (0.226/0.324) on OmniDocBench EN/ZH. For video analysis, the system leverages frame-level absolute time encoding for sub-second event localization, demonstrated by a LongVideoBench_val score of 60.7% and temporal mIoU of 50.9 on Charades-STA.
In VQA for traffic safety (Kyem et al., 13 Oct 2025), the model's architecture exploits frame-wise dynamic resolution and absolute time encodings. Specialized fine-tuning via Low-Rank Adaptation (LoRA) isolates VQA optimization, mitigating “task interference” from joint captioning. Applied to the WTS dataset (AI City 2025), Qwen2.5-VL achieves 60.80% VQA accuracy, outperforming VideoLLaMA3 (58.61%) and joint-training baselines by +8.6%.
6. Scaling, Efficiency, and Deployment Scenarios
Window attention in the ViT backbone considerably lowers the computational burden—most spatial layers limit attention to local patches (), only a minority retain global context (). The dynamic-resolution design removes the necessity for costly resampling or canonical normalization, permitting efficient and accurate tokenization of large, high-resolution images and videos.
Model sizes address deployment from edge–AI (3B; competitive with prior open 3B models), mid-range (7B; SoTA in visual reasoning at its size), to flagship (72B; matching or exceeding closed-source SOTA systems). All variants retain robust language competencies from the Qwen2.5 LLM initialization.
A plausible implication is that Qwen2.5-VL’s scalable architecture and temporal-spatial positional encoding mechanisms offer a blueprint for next-generation multimodal models supporting fine-grained perception, reasoning, and agentic manipulation, particularly where input dimensionality and content diversity challenge fixed-size architectures.
7. Summary of Capabilities and Future Considerations
Qwen2.5-VL integrates dynamic-resolution vision transformers, efficient window attention, and multimodal rotary positional embeddings to natively process arbitrary-size images and long video sequences. Substantial empirical improvements are achieved through MCTS-guided, difficulty-aware reinforcement fine-tuning, enabling data-efficient SOTA performance with minimal sample counts and no external knowledge distillation. Supported applications span VQA, mathematical reasoning, document parsing, chart/diagram analysis, object localization, agentic UI control, and long-duration event grounding. The model’s architecture and open-source performance figures position it as a reference point for future research on multimodal reasoning and data-efficient self-improvement methodologies (Bai et al., 19 Feb 2025, Wang et al., 10 Apr 2025, Kyem et al., 13 Oct 2025).
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days free