Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen2-VL Series: Advancing Vision-Language Models

Updated 22 January 2026
  • Qwen2-VL Series are large-scale vision-language models that use dynamic visual tokenization and multimodal encoding to achieve state-of-the-art results.
  • They integrate cross-modal distillation, region-level context modeling, and bootstrapped perception to excel in document parsing, object localization, and video comprehension.
  • Robust multi-phase pretraining and fine-tuning with reinforcement and preference optimization enable these models to deliver competitive benchmark performance.

The Qwen2-VL Series encompasses a family of large-scale vision-LLMs (LVLMs) built around a unified paradigm of dynamic visual tokenization, multimodal positional encoding, and scalable architectures, achieving state-of-the-art results across broad benchmarks. Designed to process and reason over images, videos, and rich context at arbitrary resolutions, Qwen2-VL models—across both the original and the Qwen2.5-VL generations—demonstrate strong capabilities in document parsing, agent-based interaction, object localization, visual-spatial reasoning, and the integration of region-level textual context. The series has been further extended through specialized training approaches such as R1-Zero-like reinforcement (vsGRPO), cross-modal distillation, and bootstrapped perception enhancement (ViPER), attaining competitive or superior performance to leading closed-source systems.

1. Core Architecture and Dynamic Visual Processing

Qwen2-VL introduces a Naive Dynamic Resolution mechanism that replaces the conventional single-size image preprocessing found in prior LVLMs. For an input image of height HH and width WW, the image is divided into non-overlapping patches of size p×pp \times p (typ. 14×1414 \times 14 in Qwen2.5-VL). The patches are embedded and then compressed via an MLP, which merges every 2×22 \times 2 group, reducing the token sequence length while maintaining spatial detail (Wang et al., 2024, Bai et al., 19 Feb 2025):

Ntokens=H/pW/pH/p/2W/p/2+2N_\mathrm{tokens} = \bigl\lceil H/p \bigr\rceil \cdot \bigl\lceil W/p \bigr\rceil \rightarrow \bigl\lceil \lceil H/p \rceil / 2 \bigr\rceil \cdot \bigl\lceil \lceil W/p \rceil / 2 \bigr\rceil + 2

(with special start/end vision tokens).

Qwen2-VL models employ Multimodal Rotary Position Embedding (M-RoPE), a generalization of rotary encodings that incorporates temporal (frame), height, and width indices into the positional signal. This enables the processing of both images and videos with unified code paths and precise spatio-temporal awareness:

Q=Rt(θt)Rh(θh)Rw(θw)QQ' = R_t(\theta^t) R_h(\theta^h) R_w(\theta^w) Q

where θt\theta^t, θh\theta^h, θw\theta^w encode temporal and spatial coordinates. In Qwen2.5-VL, absolute time encoding further grounds model attention to real-time positions in long videos (Bai et al., 19 Feb 2025).

Vision backbones consist of high-capacity Vision Transformers (ViT), pretrained from scratch to handle native dynamic input. For videos, temporal dimension is handled by 3D convolutional front-ends or equivalent windowed self-attention (Wang et al., 2024, Bai et al., 19 Feb 2025).

2. Model Scaling, Training Strategy, and Data

The Qwen2-VL Series features models at various scales: 2B, 7B/8B, and 72B parameters, with pretraining on vast multi-modal corpora (up to 2 trillion tokens, with up to 800B image tokens at stage 2 for the 7B model). The scaling laws for different tasks are explicitly measured: performance P(N)P(N) grows as P(N)acNαcP(N)\sim a_c N^{\alpha_c}, with reported exponents for math, OCR, and video ranging from 0.1 to 0.3, depending on capability (Wang et al., 2024).

The pretraining regime is multi-phased:

  • Stage 1: CLIP-style vision-only contrastive training to strengthen patch-level semantics.
  • Stage 2: Joint multimodal (images, videos, interleaved vision-text, VQA, agentic) objectives, with all parameters updated.
  • Stage 3: Extended context (up to 32k tokens) to support hours-long video and document sequences, and agent-based reasoning (Bai et al., 19 Feb 2025).

Fine-tuning includes supervised instruction tuning (SFT), Direct Preference Optimization (DPO) using human feedback, and domain-specific reinforcement strategies (e.g., Group Relative Policy Optimization for spatial reasoning (Liao et al., 1 Apr 2025)).

The models' dynamic resolution, native absolute-time grounding, and windowed self-attention enable the direct processing of unnormalized high-res images and lengthy, variable-frame-rate videos.

3. Functional Capabilities and Specialized Extensions

Qwen2-VL supports precise object localization (bounding boxes, points), robust structured data extraction (invoices, tables, charts), and long-video comprehension with second-level temporal localization. Model outputs use structured HTML schemas for document parsing, and interactive function call APIs for agentic operation.

The series has been extended in several directions:

  • Visual-Spatial Reinforcement (vsGRPO): R1-Zero-like GRPO fine-tuning on VSI-100k yields major gains (+12.1 pp for 2B, +8.5 pp for 7B) in visual-spatial intelligence, outperforming GPT-4o and LLaVA-NeXT-Video-72B on video-based spatial reasoning. KL regularization is critical for stability (Liao et al., 1 Apr 2025).
  • Region-Level Context-Aware Modeling: Augmenting Qwen2-VL with object-level box/text tokens (via RCVIT) using the RCMU dataset yields RC-Qwen2-VL. This achieves absolute gains up to +57 points on contextual visual QA and substantial improvements in citation reliability and personalized multimodal tasks. Cross-modal attention mechanisms tightly integrate region-level context (Wei et al., 17 Aug 2025).
  • Cross-Modal Distillation: Bidirectional distillation between Qwen2-VL and Qwen2-Audio, guided by a PANN-based heuristic switch, closes the sensory gap in visible sound recognition, yielding >20% absolute gains in audio classification and robust modality transfer (Jiang et al., 11 May 2025).
  • Self-Bootstrapped Fine-Grained Perception (ViPER): Qwen2-VL and Qwen2.5-VL, under the ViPER regimen, are further enhanced by self-consistency loops between caption generation, image reconstruction (via diffusion models), and RL-based closed-loop self-prediction. Qwen-Viper models achieve average gains of 1.6–1.7% and up to 6% on fine-grained perception, demonstrating emergent reciprocal learning between generation and understanding (Zhang et al., 28 Oct 2025).

4. Empirical Performance and Benchmark Coverage

Qwen2-VL-72B and Qwen2.5-VL-72B models consistently attain or surpass the highest scores on a broad spectrum of benchmarks:

Benchmark Claude 3.5 Sonnet GPT-4o Qwen2-VL-72B Qwen2.5-VL-72B
DocVQA_test 95.2 92.8 96.5
InfoVQA_test 84.5 87.3
ChartQA_test 90.8 85.7 88.3
TextVQA_val 85.5
OCRBench 788 736 877
LVBench (Long Video) 30.8 47.3
Charades-STA (mIoU) 35.7 50.9
MMBench-EN (%) 82.6 83.4 88.6

RC-Qwen2-VL provides region-level context performance boosts up to +57 absolute points, while ViPER-augmented Qwen-Viper yields +1.6–1.7% overall and up to +6% on fine-grained axes (Wei et al., 17 Aug 2025, Zhang et al., 28 Oct 2025).

On visible sound recognition, Qwen2-VL-7B achieves 80.8% test accuracy (video-only, VGGSound, 286 classes), far exceeding Qwen2-Audio (69.0%) and surpassing human eye-only performance (74.1%), with bidirectional distillation further raising Qwen2-Audio to 89.4% (Jiang et al., 11 May 2025).

5. Agent-Based Interaction and Long-Context Reasoning

Qwen2-VL and especially Qwen2.5-VL integrate with the Qwen-Agent framework for high-level agentic operations. This enables:

  • GUI grounding on smartphones and computers (ScreenSpot accuracy up to 87.1%)
  • Offline Android control tasks (High-EM 67.36%)
  • Online environment tasks such as AndroidWorld SR (35%), OSWorld (8.83)

The models natively handle event localization in hours-long videos with absolute time anchoring. Sequence length support extends to 32,768 tokens, critical for long-form document and video comprehension (Bai et al., 19 Feb 2025).

6. Implementation, Open Source Releases, and Reproducibility

Qwen2-VL is fully open-sourced at https://github.com/QwenLM/Qwen2-VL, supporting PyTorch 2.1.2, CUDA 11.8, FlashAttention, and Apex fused kernels. Training leverages DeepSpeed for large model parallelism. Released weights include 2B, 7B, and 72B checkpoints across both Qwen2-VL and Qwen2.5-VL variants. Agent APIs and benchmarks such as RCMU and RC-P-Bench are public for evaluation and further research (Wang et al., 2024, Bai et al., 19 Feb 2025, Wei et al., 17 Aug 2025).

7. Open Challenges and Perspectives

Current challenges include reward hacking in RL fine-tuning (e.g., format score inflation under vsGRPO), plateaus in reward-derived learning, and the limits of vanilla CoT prompting for spatial or fine-grained perception (Liao et al., 1 Apr 2025). Region-level and perceptual self-improvement strategies indicate that tight integration of context and bidirectional critique loops are essential for next-generation VLMs. Emergent behaviors such as “thinking-with-images” and generation-understanding reciprocity, as evidenced in ViPER, suggest promising directions where models can internalize and self-refine perceptual skills without heavy reliance on external labels. The Qwen2-VL architecture, with its native-dynamic processing, agentic wrappers, and extensible RL ecosystem, forms a foundation for multimodal AI systems that natively bridge language, vision, and context across diverse environments.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen2-VL Series.