Qwen2.5-VL Multimodal Encoder

Updated 27 February 2026

The paper demonstrates that Qwen2.5-VL achieves state-of-the-art performance in multimodal tasks through a Vision Transformer backbone combined with dynamic resolution and window attention.
Its architecture efficiently fuses visual and textual data using cross-modal compression, enhancing object localization, document parsing, and long-video comprehension.
The model scales efficiently with tailored training stages and data augmentation, delivering robust benchmark performance on tasks like VQA, OCR parsing, and long-video analysis.

Qwen2.5-VL is the flagship multimodal encoder of the Qwen vision-LLM series, providing foundational capabilities in image and video understanding, object localization, document parsing, and structured data extraction. Its architecture, training regimen, and functional range are designed for advanced cross-modal reasoning tasks while maintaining computational efficiency and scalability. The following sections detail the key aspects of the Qwen2.5-VL Multimodal Encoder, citing its technical foundations and performance characteristics (Bai et al., 19 Feb 2025).

1. Vision Transformer Backbone with Dynamic Resolution and Window Attention

Qwen2.5-VL employs a Vision Transformer (ViT) trained from scratch to operate directly on arbitrary image resolutions, avoiding resizing and cropping. For a given input of height $H$ and width $W$ , the image is partitioned into non-overlapping square patches of size $S=14$ pixels. The effective patch grid is

$H' = \lceil H/S \rceil, \quad W' = \lceil W/S \rceil$

resulting in $H'\times W'$ tokens, which are projected and processed by a 32-layer Transformer.

To mitigate the quadratic cost of global self-attention over $N=H'\cdot W'$ tokens, a Window Attention mechanism is applied in 28 out of the 32 layers. Each window is $w\times w$ patches ( $w=8$ ; $112\times112$ pixels). Within a window, standard self-attention is performed, but four layers ( $\ell \in \{7,15,23,31\}$ ) use global attention to capture non-local dependencies:

$\text{Attn}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right) V$

This scheme reduces the per-layer computational cost from $\mathcal{O}((H'W')^2)$ to $\mathcal{O}(w^2H'W')$ , preserving native image resolution and aspect ratio throughout both training and inference (Bai et al., 19 Feb 2025).

2. Temporal Encoding and Long-Video Processing

For long video comprehension, Qwen2.5-VL generalizes rotary position embedding (RoPE) from 2D to 3D (temporal $\times$ spatial), aligning the temporal component to absolute time in seconds. Each video frame $i$ is assigned a timestamp $\tau_i = t_i$ , and temporal position embedding is computed by:

$\omega_i = 1/10000^{2i/d},\quad \mathrm{PE}_t(t)[2i] = \sin(\omega_i t),\quad \mathrm{PE}_t(t)[2i+1] = \cos(\omega_i t)$

The embedding at $(\tau, x, y)$ is

$e(\tau,x,y) = (R_{\mathrm{2D}}(x, y) \oplus R_t(\tau)) \odot h$

where $R_{\mathrm{2D}}$ is for spatial coordinates, $R_t$ encodes wall-clock time, $\oplus$ denotes concatenation, and $\odot$ is the rotation operation (Bai et al., 19 Feb 2025). This enables precise second-level event localization in videos of extended duration.

Patch token sequences from the vision encoder are compressed into higher-level representations before fusion with language. Four spatially adjacent visual tokens are grouped:

$g_j = \mathrm{concat}(v_{4j}, v_{4j+1}, v_{4j+2}, v_{4j+3})$

Each $g_j$ is projected by a two-layer MLP with SwiGLU activation:

$h_j = W_2 \cdot \sigma(W_1 g_j + b_1) + b_2$

where the output matches the LLM embedding dimension.

These compressed visual features $\{h_j\}$ are interleaved with textual tokens and passed through the LLM's self-attention mechanism, realizing cross-modal interaction without explicit projection layers or separate cross-attention modules. In extended applications such as Qwen-Image, this interleaving is implemented via simple token sequence concatenation, with scalable 2D RoPE (“MSRoPE”) for unified positional alignment across modalities (Wu et al., 4 Aug 2025).

4. Object Localization, Structured Extraction, and Grounding

Qwen2.5-VL natively supports object localization using bounding boxes and point grounding. The final cross-modal feature for a grounding query token $q$ is used to predict normalized coordinates:

$\mathbf{b} = [x_1, y_1, x_2, y_2] = W_{\mathrm{bbox}} q + b_{\mathrm{bbox}}$

$\mathbf{p} = [x, y] = W_{\mathrm{point}} q + b_{\mathrm{point}}$

Training employs a composite loss:

$\mathcal{L}_{\mathrm{bbox}} = \lambda_1 \| \mathbf{b} - \mathbf{b}^* \|_1 + \lambda_2 (1 - \mathrm{IoU}(\mathbf{b},\mathbf{b}^*)),\quad \mathcal{L}_{\text{point}} = \| \mathbf{p} - \mathbf{p}^* \|_2^2$

For structured extraction, pre-training utilizes “QwenVL HTML” format documents embedding layout boxes and nested elements in HTML with bounding box attributes. The model is trained to faithfully output this structure, enabling unified parsing of text, tables, charts, and diagrams (Bai et al., 19 Feb 2025).

5. Training Pipeline, Objectives, and Data

Pre-training proceeds in three sequential stages:

Stage 1: 1.5T tokens, sequence length 8192, ViT-only CLIP-style pre-training (image captioning, visual knowledge, OCR), with contrastive loss (InfoNCE).
Stage 2: 2T tokens, sequence length 8192, end-to-end ViT+LLM (image-text interleaving, VQA, multimodal math, agent tasks, pure text), with cross-entropy for generative tasks and regression losses for grounding.
Stage 3: 0.6T tokens, sequence length 32768, long-document and long-video pre-training using dynamic FPS sampling and extended contexts.

Data augmentation includes random cropping at native resolution, copy-paste object augmentation, synthetic table/chart generation, and dynamic sampling rates for video (Bai et al., 19 Feb 2025). Optimizers and data choices are standard for large-scale vision-language systems.

6. Computational Efficiency and Scaling

The use of dynamic native resolution and Window Attention reduces the vision encoder FLOPs by approximately fourfold for typical images, and reduces memory peak correspondingly. For example, in a $224\times224$ image, vision FLOPs drop from $65\,536\,d$ (global) to $16\,384\,d$ (windowed), 25% of the original. Dynamic sequence packing across modalities maintains high hardware utilization (Bai et al., 19 Feb 2025). In video tasks, query-based token selection methods such as QTSplus achieve up to $89\%$ reduction in visual token count and $28\%$ lower latency with minimal accuracy degradation (Li et al., 14 Nov 2025).

7. Benchmark Performance and Applications

Qwen2.5-VL-72B achieves parity or outperforms GPT-4o and Claude 3.5 Sonnet-0620 on major benchmarks:

MMBench-EN (General VQA): 88.6%
CC-OCR (OCR Parsing): 79.8
OmniDocBench (Document Editing): Edit Distance 0.226 (EN), 0.324 (ZH)
ChartQA (Chart Understanding): 89.5
LVBench (Long-Video Comprehension): 47.3

The smaller 7B and 3B variants similarly exceed comparably-sized open-source models, especially on fine-grained perception, temporal grounding, and structured extraction (Bai et al., 19 Feb 2025). Integration into image generation systems (e.g., Qwen-Image) leverages both semantic and reconstructive encoding, supporting advanced editing and compositional generation (Wu et al., 4 Aug 2025).

Task	GPT-4o	Claude 3.5	Qwen2.5-VL
MMBench-EN (VQA)	54.2%	52.1%	88.6%
CC-OCR (OCR Parsing)	66.9	62.5	79.8
OmniDocBench (ED EN/ZH)	0.265/0.435	0.330/0.381	0.226/0.324
ChartQA (Chart Understanding)	86.7	81.2	89.5
LVBench (Long-Video)	30.8	33.1	47.3

Qwen2.5-VL’s architecture, multimodal fusion, and scaling properties enable deployment across real-world scenarios including interactive agents, document intelligence, and vision-language generation platforms.

Markdown Report Issue Upgrade to Chat

References (3)

Qwen2.5-VL Technical Report (2025)

Qwen-Image Technical Report (2025)

Seeing the Forest and the Trees: Query-Aware Tokenizer for Long-Video Multimodal Language Models (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-VL Multimodal Encoder.

Qwen2.5-VL Multimodal Encoder

1. Vision Transformer Backbone with Dynamic Resolution and Window Attention

2. Temporal Encoding and Long-Video Processing

4. Object Localization, Structured Extraction, and Grounding

5. Training Pipeline, Objectives, and Data

6. Computational Efficiency and Scaling

7. Benchmark Performance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Qwen2.5-VL Multimodal Encoder

1. Vision Transformer Backbone with Dynamic Resolution and Window Attention

2. Temporal Encoding and Long-Video Processing

3. Multimodal Fusion and Cross-Modal Compression

4. Object Localization, Structured Extraction, and Grounding

5. Training Pipeline, Objectives, and Data

6. Computational Efficiency and Scaling

7. Benchmark Performance and Applications

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research