Papers
Topics
Authors
Recent
2000 character limit reached

Qwen-VL: Multimodal Language Models

Updated 29 November 2025
  • Qwen-VL is a family of large-scale multimodal language models that integrate visual and textual inputs for comprehensive image, document, and video understanding.
  • The architecture features innovations like two-tower design, dynamic resolution, and interleaved tokenization, enabling robust processing of high-resolution visuals.
  • Training involves staged multilingual pre-training and fine-tuning, achieving state-of-the-art benchmarks in VQA, OCR, and visual grounding tasks.

Qwen-VL Series is a family of large-scale multimodal LLMs enabling sophisticated joint reasoning over visual and linguistic inputs. Designed primarily for robust image, document, and video understanding, these models employ architectural innovations and high-quality, multilingual data curation. The Qwen-VL development lineage includes several major versions with evolving capabilities, culminating in Qwen3-VL, which supports dense and Mixture-of-Experts architectures, ultra-long context, and advanced agentic applications (Bai et al., 2023, Wang et al., 18 Sep 2024, Bai et al., 19 Feb 2025, Bai et al., 26 Nov 2025).

1. Architectural Foundations and Evolution

The original Qwen-VL employs a two-tower design: a visual backbone (ViT-bigG from OpenCLIP) connected via a cross-attention VL-Adapter to a frozen Qwen-7B LLM. Visual inputs–typically resized to 224×224 or 448×448–are embedded as patch sequences and compressed to 256 visual tokens using cross-attention and 2D position encodings. These tokens interleave with text tokens for unified Transformer autoregression. Structured outputs (bounding boxes, references, OCR text) utilize explicit token markers and position serialization (Bai et al., 2023).

Qwen2-VL redesigned the vision encoder for native dynamic resolution. The proprietary ViT variant processes arbitrary H×W images, applies 2D-RoPE positional encoding, and merges adjacent patch features through an MLP. For videos, 3D convolutions extract spatio-temporal tubes, and Multimodal Rotary Position Embedding (M-RoPE) fuses height, width, and time axes. All visual and textual features enter a shared LLM backbone (Wang et al., 18 Sep 2024).

Dynamic resolution is a persistent design motif. From Qwen2-VL onwards, absolute position embeddings are eliminated, and token counts adapt to input granularity, which preserves high-resolution detail and supports large document images, charts, or dense videos (Bai et al., 19 Feb 2025).

Qwen2.5-VL introduces windowed self-attention within the ViT (local blocks of 112×112 pixels), further reducing computational overhead, and absolute time encoding for event localization in long videos (second-level granularity). The vision encoder is trained from scratch for dynamic resolution and paired with a merger MLP for LLM compatibility. Structured document parsing and agentic GUI interactions become core functionalities (Bai et al., 19 Feb 2025).

Qwen3-VL further unifies multimodal sequences under an interleaved 256K-token context window and employs an enhanced MRoPE with axis interleaving. DeepStack integration leverages multi-depth ViT layers at LLM entry, boosting alignment and fine-grained visual reasoning. For video, textual timestamp tokens replace previous rotary temporal embeddings for explicit time grounding (Bai et al., 26 Nov 2025).

2. Training Paradigms and Data Composition

Qwen-VL and its successors follow a staged curriculum:

  • Stage 1: Vision–language pre-training on cleaned web-scraped image–text pairs, with multilingual coverage (LAION, DataComp, COYO, MS-COCO, in-house Chinese datasets).
  • Stage 2: Multi-task pre-training, interleaving captioning, VQA, grounding, OCR (with coordinate output), and chain-of-thought text tasks.
  • Stage 3: Supervised instruction fine-tuning, using multimodal dialogues in ChatML format, multi-image examples, and function-calling for agentic tasks (Bai et al., 2023).

Qwen2-VL and Qwen2.5-VL scale model size (2B, 7B, 72B params) and data volume (1.4T to 4.1T tokens), investigating empirical scaling laws with observed performance exponents α≈0.2–0.3 for model size and β≈0.1–0.2 for data (Wang et al., 18 Sep 2024, Bai et al., 19 Feb 2025). The Qwen3-VL family expands data mixture to 2T tokens, including STEM diagrams and agent trajectories, and adapts training for ultra-long contexts (Bai et al., 26 Nov 2025).

3. Multimodal Positional Encoding and Context Handling

Multimodal Rotary Position Embedding (M-RoPE) factorizes relative position angles for height, width, and time, with modality-specific fusion during self-attention: ϕk(h,w,t)=ϕkH(h)+ϕkW(w)+ϕkT(t)\phi_{k}(h,w,t) = \phi^{\mathrm{H}}_{k}(h) + \phi^{\mathrm{W}}_{k}(w) + \phi^{\mathrm{T}}_{k}(t) Vector subdimensions are rotated accordingly, fusing spatial and temporal cues (Wang et al., 18 Sep 2024). In Qwen2.5-VL, for video, time indices align to real-world seconds. Qwen3-VL further generalizes RoPE by interleaving axis-specific frequencies across embedding dimensions and prefixes video sequences with explicit timestamp tokens (e.g., "<3.0 seconds><3.0\,\text{seconds}>") to sharpen temporal alignment (Bai et al., 26 Nov 2025).

Native dynamic resolution allows scalable tokenization of images and videos. For patch size pp and image H×WH\times W: Npatch=⌈H/p⌉×⌈W/p⌉,Ntok=⌈H/2p⌉×⌈W/2p⌉N_{\text{patch}} = \lceil H/p \rceil \times \lceil W/p \rceil,\quad N_{\text{tok}} = \lceil H/2p \rceil \times \lceil W/2p \rceil Sequences may include up to 16,384 visual tokens, capped dynamically at inference (Wang et al., 18 Sep 2024).

4. Advanced Capabilities and Applications

Qwen-VL Series supports general VQA, image captioning, visual grounding (bounding boxes, points), OCR (text reading with quad coordinates), document parsing (HTML-like output), referring expression comprehension, and multi-image dialogue. Specialized functionalities include agentic decision-making—GUI element identification and tool calling—as well as detailed analysis of charts, diagrams, and mathematical problems via chain-of-thought (Bai et al., 2023, Bai et al., 19 Feb 2025).

Qwen2.5-VL and Qwen3-VL add second-level event localization in videos and interactive agent behavior in mobile and desktop interfaces ("Tap on Settings", etc.). Structured extraction uses data-bbox attributes for paragraphs, tables, and figures (Bai et al., 19 Feb 2025). For videos, real-time event retrieval, cross-document referencing for long contexts (up to 256K tokens in Qwen3-VL), and multi-hour video comprehension become natively supported (Bai et al., 26 Nov 2025).

5. Performance Benchmarks and Quantitative Comparison

Across a range of multimodal benchmarks, Qwen-VL Series achieves competitive or state-of-the-art results:

Benchmark Qwen-VL-72B GPT-4o Claude 3.5 S.
DocVQA (test) 96.5 92.8 95.2
OCRBench 877 736 788
RefCOCO val (box) 92.7 69.1 73.2
MMBench-EN (VQA) 88.6 83.4 82.6
LVBench (video) 47.3 30.8 30.8

Models demonstrate stable few-shot/in-context learning, robust multi-lingual performance (Chinese/English, OCRBench_v2: Qwen2.5-VL-72B 61.5/63.7%), and superior agentic task execution (Wang et al., 18 Sep 2024, Bai et al., 19 Feb 2025). Qwen3-VL benchmarks:

A plausible implication is that Qwen-VL models are particularly strong in document parsing, grounding, and long-context video QA.

6. Compression, Quantization, and Deployment

Qwen2.5-VL leverages LUQ (Layerwise Ultra-Low Bit Quantization), selecting layers with low Shannon entropy for ultra-low bit (∼1.08 bits, BiLLM) and higher-entropy layers for standard 4-bit (GPTQ). For the 7B model, 12 layers are quantized to 1.08 bits, and 16 to 4 bits, yielding a model with average 2.75 bits per parameter and a 31.25% memory reduction relative to uniform 4-bit quantization. Performance drops are <10% on hardest MME and 3–5% on VQAv2/TextVQA, enabling edge deployment (Bhatnagar et al., 28 Sep 2025).

LUQ quantization flow includes:

  • Activation entropy measurement per layer via K-means clustering and Shannon entropy.
  • Greedy layer selection and mixed precision PTQ over calibration data.
  • Binary search over memory/performance budget, with progressive inference validation.

Mixed-modal calibration ensures robust dynamic range estimation for multimodal activations <4-bit (Bhatnagar et al., 28 Sep 2025).

7. Perception Evolution, Limitations, and Future Directions

ViPER framework structures perceptual learning as a coarse-to-fine, self-bootstrapping curriculum: image-level caption refinement and instance-level operation prediction, both reconstructed and critiqued via diffusion models. Qwen-Viper models (enhanced Qwen2.5-VL) achieve up to +6.0 pp on fine-grained perception and +1.7 pp on overall multimodal benchmarks. Self-critique and self-prediction loops synthesize new training data, facilitating autonomous evolution of visual perception capabilities (Zhang et al., 28 Oct 2025).

Limitations include residual performance loss on extremely sparse spatial domains, occasional temporal drift in video, and dependence on generative model fidelity for self-supervision. Continued research explores integrated multimodal tool chains, learned task selection, and extended self-supervised loops for agentic autonomy and alignment (Bai et al., 26 Nov 2025, Zhang et al., 28 Oct 2025).


Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen-VL.