Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen2.5-VL Multimodal Encoder

Updated 27 February 2026
  • The paper demonstrates that Qwen2.5-VL achieves state-of-the-art performance in multimodal tasks through a Vision Transformer backbone combined with dynamic resolution and window attention.
  • Its architecture efficiently fuses visual and textual data using cross-modal compression, enhancing object localization, document parsing, and long-video comprehension.
  • The model scales efficiently with tailored training stages and data augmentation, delivering robust benchmark performance on tasks like VQA, OCR parsing, and long-video analysis.

Qwen2.5-VL is the flagship multimodal encoder of the Qwen vision-LLM series, providing foundational capabilities in image and video understanding, object localization, document parsing, and structured data extraction. Its architecture, training regimen, and functional range are designed for advanced cross-modal reasoning tasks while maintaining computational efficiency and scalability. The following sections detail the key aspects of the Qwen2.5-VL Multimodal Encoder, citing its technical foundations and performance characteristics (Bai et al., 19 Feb 2025).

1. Vision Transformer Backbone with Dynamic Resolution and Window Attention

Qwen2.5-VL employs a Vision Transformer (ViT) trained from scratch to operate directly on arbitrary image resolutions, avoiding resizing and cropping. For a given input of height HH and width WW, the image is partitioned into non-overlapping square patches of size S=14S=14 pixels. The effective patch grid is

H=H/S,W=W/SH' = \lceil H/S \rceil, \quad W' = \lceil W/S \rceil

resulting in H×WH'\times W' tokens, which are projected and processed by a 32-layer Transformer.

To mitigate the quadratic cost of global self-attention over N=HWN=H'\cdot W' tokens, a Window Attention mechanism is applied in 28 out of the 32 layers. Each window is w×ww\times w patches (w=8w=8; 112×112112\times112 pixels). Within a window, standard self-attention is performed, but four layers ({7,15,23,31}\ell \in \{7,15,23,31\}) use global attention to capture non-local dependencies:

Attn(Q,K,V)=softmax(QKdk)V\text{Attn}(Q,K,V) = \mathrm{softmax} \left( \frac{QK^\top}{\sqrt{d_k}} \right) V

This scheme reduces the per-layer computational cost from O((HW)2)\mathcal{O}((H'W')^2) to O(w2HW)\mathcal{O}(w^2H'W'), preserving native image resolution and aspect ratio throughout both training and inference (Bai et al., 19 Feb 2025).

2. Temporal Encoding and Long-Video Processing

For long video comprehension, Qwen2.5-VL generalizes rotary position embedding (RoPE) from 2D to 3D (temporal ×\times spatial), aligning the temporal component to absolute time in seconds. Each video frame ii is assigned a timestamp τi=ti\tau_i = t_i, and temporal position embedding is computed by:

ωi=1/100002i/d,PEt(t)[2i]=sin(ωit),PEt(t)[2i+1]=cos(ωit)\omega_i = 1/10000^{2i/d},\quad \mathrm{PE}_t(t)[2i] = \sin(\omega_i t),\quad \mathrm{PE}_t(t)[2i+1] = \cos(\omega_i t)

The embedding at (τ,x,y)(\tau, x, y) is

e(τ,x,y)=(R2D(x,y)Rt(τ))he(\tau,x,y) = (R_{\mathrm{2D}}(x, y) \oplus R_t(\tau)) \odot h

where R2DR_{\mathrm{2D}} is for spatial coordinates, RtR_t encodes wall-clock time, \oplus denotes concatenation, and \odot is the rotation operation (Bai et al., 19 Feb 2025). This enables precise second-level event localization in videos of extended duration.

3. Multimodal Fusion and Cross-Modal Compression

Patch token sequences from the vision encoder are compressed into higher-level representations before fusion with language. Four spatially adjacent visual tokens are grouped:

gj=concat(v4j,v4j+1,v4j+2,v4j+3)g_j = \mathrm{concat}(v_{4j}, v_{4j+1}, v_{4j+2}, v_{4j+3})

Each gjg_j is projected by a two-layer MLP with SwiGLU activation:

hj=W2σ(W1gj+b1)+b2h_j = W_2 \cdot \sigma(W_1 g_j + b_1) + b_2

where the output matches the LLM embedding dimension.

These compressed visual features {hj}\{h_j\} are interleaved with textual tokens and passed through the LLM's self-attention mechanism, realizing cross-modal interaction without explicit projection layers or separate cross-attention modules. In extended applications such as Qwen-Image, this interleaving is implemented via simple token sequence concatenation, with scalable 2D RoPE (“MSRoPE”) for unified positional alignment across modalities (Wu et al., 4 Aug 2025).

4. Object Localization, Structured Extraction, and Grounding

Qwen2.5-VL natively supports object localization using bounding boxes and point grounding. The final cross-modal feature for a grounding query token qq is used to predict normalized coordinates:

b=[x1,y1,x2,y2]=Wbboxq+bbbox\mathbf{b} = [x_1, y_1, x_2, y_2] = W_{\mathrm{bbox}} q + b_{\mathrm{bbox}}

p=[x,y]=Wpointq+bpoint\mathbf{p} = [x, y] = W_{\mathrm{point}} q + b_{\mathrm{point}}

Training employs a composite loss:

Lbbox=λ1bb1+λ2(1IoU(b,b)),Lpoint=pp22\mathcal{L}_{\mathrm{bbox}} = \lambda_1 \| \mathbf{b} - \mathbf{b}^* \|_1 + \lambda_2 (1 - \mathrm{IoU}(\mathbf{b},\mathbf{b}^*)),\quad \mathcal{L}_{\text{point}} = \| \mathbf{p} - \mathbf{p}^* \|_2^2

For structured extraction, pre-training utilizes “QwenVL HTML” format documents embedding layout boxes and nested elements in HTML with bounding box attributes. The model is trained to faithfully output this structure, enabling unified parsing of text, tables, charts, and diagrams (Bai et al., 19 Feb 2025).

5. Training Pipeline, Objectives, and Data

Pre-training proceeds in three sequential stages:

  • Stage 1: 1.5T tokens, sequence length 8192, ViT-only CLIP-style pre-training (image captioning, visual knowledge, OCR), with contrastive loss (InfoNCE).
  • Stage 2: 2T tokens, sequence length 8192, end-to-end ViT+LLM (image-text interleaving, VQA, multimodal math, agent tasks, pure text), with cross-entropy for generative tasks and regression losses for grounding.
  • Stage 3: 0.6T tokens, sequence length 32768, long-document and long-video pre-training using dynamic FPS sampling and extended contexts.

Data augmentation includes random cropping at native resolution, copy-paste object augmentation, synthetic table/chart generation, and dynamic sampling rates for video (Bai et al., 19 Feb 2025). Optimizers and data choices are standard for large-scale vision-language systems.

6. Computational Efficiency and Scaling

The use of dynamic native resolution and Window Attention reduces the vision encoder FLOPs by approximately fourfold for typical images, and reduces memory peak correspondingly. For example, in a 224×224224\times224 image, vision FLOPs drop from 65536d65\,536\,d (global) to 16384d16\,384\,d (windowed), 25% of the original. Dynamic sequence packing across modalities maintains high hardware utilization (Bai et al., 19 Feb 2025). In video tasks, query-based token selection methods such as QTSplus achieve up to 89%89\% reduction in visual token count and 28%28\% lower latency with minimal accuracy degradation (Li et al., 14 Nov 2025).

7. Benchmark Performance and Applications

Qwen2.5-VL-72B achieves parity or outperforms GPT-4o and Claude 3.5 Sonnet-0620 on major benchmarks:

  • MMBench-EN (General VQA): 88.6%
  • CC-OCR (OCR Parsing): 79.8
  • OmniDocBench (Document Editing): Edit Distance 0.226 (EN), 0.324 (ZH)
  • ChartQA (Chart Understanding): 89.5
  • LVBench (Long-Video Comprehension): 47.3

The smaller 7B and 3B variants similarly exceed comparably-sized open-source models, especially on fine-grained perception, temporal grounding, and structured extraction (Bai et al., 19 Feb 2025). Integration into image generation systems (e.g., Qwen-Image) leverages both semantic and reconstructive encoding, supporting advanced editing and compositional generation (Wu et al., 4 Aug 2025).

Task GPT-4o Claude 3.5 Qwen2.5-VL
MMBench-EN (VQA) 54.2% 52.1% 88.6%
CC-OCR (OCR Parsing) 66.9 62.5 79.8
OmniDocBench (ED EN/ZH) 0.265/0.435 0.330/0.381 0.226/0.324
ChartQA (Chart Understanding) 86.7 81.2 89.5
LVBench (Long-Video) 30.8 33.1 47.3

Qwen2.5-VL’s architecture, multimodal fusion, and scaling properties enable deployment across real-world scenarios including interactive agents, document intelligence, and vision-language generation platforms.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2.5-VL Multimodal Encoder.