Papers
Topics
Authors
Recent
2000 character limit reached

Qwen3-VL: Scalable Cloud & Mobile LVLMs

Updated 26 November 2025
  • Qwen3-VL Model Series is a collection of large-scale vision-language models that integrate advanced multimodal pretraining, scalable architecture design, and hardware-aware optimizations.
  • It features both cloud-oriented and mobile-focused variants, achieving robust performance across image captioning, VQA, visual grounding, and multilingual multimodal understanding.
  • Innovative techniques like the 1+N LoRA approach and quantization-aware training ensure parameter efficiency and enable real-time on-device deployment.

Qwen3-VL Model Series comprises large-scale vision-LLMs (LVLMs) designed to integrate state-of-the-art cross-modal reasoning with efficient mobile deployment. Built on the Qwen and Qwen3 LLM families, Qwen3-VL unifies advances in multimodal pretraining, scalable architecture design, parameter-efficient adaptation, and hardware-aware optimization. The series includes both cloud-oriented (Qwen-VL, Qwen-VL-Chat) and mobile-side (AndesVL) variants, and demonstrates leading benchmark performance across image captioning, general and text-oriented visual question answering (VQA), visual grounding, multilingual multimodal understanding, and multimodal instruction-following (Bai et al., 2023, Jin et al., 13 Oct 2025).

1. Architectural Design

Qwen3-VL models employ a three-component architecture: a high-capacity visual encoder, a projection adapter, and a transformer-based LLM. In the original Qwen-VL, a frozen Vision Transformer (ViT-bigG) encodes the image as patch tokens, which are enriched with 2D absolute positional encodings. A learnable, position-aware VL Adapter implements cross-attention from query embeddings (Q0∈R256×dQ_0\in\mathbb{R}^{256\times d}) to compress the ViT output into a fixed-length sequence (Z∈R256×dZ\in\mathbb{R}^{256\times d}), serialized and injected into the LLM as, e.g., <img>... </img> special tokens.

AndesVL (the Qwen3-VL mobile-focused line) generalizes this paradigm. Its structure consists of:

  • A ViT-based visual encoder (e.g., AIMv2-Large or SigLIP2-Base) with native-resolution, 2D-RoPE positional embeddings;
  • Pixel-shuffle downsampling (groups 4×44\times4 patches into macro-patches);
  • An MLP projector aligning visual features to the LLM token dimension:

Z=ReLU(XW1+b1)W2+b2Z = \mathrm{ReLU}(XW_1 + b_1)W_2 + b_2

where X∈RN×DvX\in\mathbb{R}^{N\times D_v}, W1∈RDv×RW_1\in\mathbb{R}^{D_v\times R}, W2∈RR×DℓW_2\in\mathbb{R}^{R\times D_\ell};

  • A Qwen3 transformer LLM, available in configurations with 0.6B, 1.7B, or 4B parameters.

Data is formatted for the autoregressive LLM as a unified input sequence containing textual content, serialized bounding boxes (<box>(x_{tl},y_{tl}),(x_{br},y_{br})</box>), image tokens, and references.

Model Parameters (B) Vision Encoder LLM
AndesVL-0.6B 0.695 SigLIP2-Base Qwen3-0.6B
AndesVL-1B 0.927 AIMv2-Large Qwen3-0.6B
AndesVL-2B 2.055 AIMv2-Large Qwen3-1.7B
AndesVL-4B 4.360 AIMv2-Large Qwen3-4B

2. Training Regimens and Multimodal Data

Qwen3-VL training follows a staged pipeline utilizing a cleaned, multilingual, multimodal corpus.

Qwen-VL Stages:

  1. Vision-Language Pre-training:
    • 1.4B image-caption pairs (77.3% English, 22.7% Chinese); ViT and adapter are optimized under cross-entropy while the LM is frozen.
  2. Multi-Task Pre-training:
    • Seven interleaved tasks (captioning, VQA, visual grounding, referring expressions, grounded captioning, OCR, text auto-regression); LLM is unfrozen for joint optimization; mixed sequence lengths up to 2048.
  3. Instruction Tuning:
    • 350K multimodal dialogues (ChatML format) for Qwen-VL-Chat; ViT frozen, LLM and adapter tuned.

AndesVL Stages:

  1. Vision–Language Alignment:
    • Captions, OCR, VQA; only ViT and MLP trainable.
  2. Joint Pre-training:
    • Full model trained on mixed image–text and text-only data.
  3. Multi-task Pre-training:

Post-training steps include SFT (supervised fine-tuning, 16M ChatML-formatted examples), Mixed Preference Optimization (MPO, 80K preference pairs), and, in "Thinking" models, on-policy RL with a curriculum of STEM-centric multi-modal chain-of-thought samples.

Data cleaning involves aspect-ratio filtering, CLIP-score ranking, language and length controls, and OCR-specific augmentation. For grounding, balanced "Text-in-Region"/"Region-in-Text" instances are curated.

Source Original (M) Cleaned (M) Remain (%)
LAION-en 2,000 280 14
LAION-COCO 600 300 50
DataComp 1,400 300 21
Coyo 700 200 28
CC12M/CC3M/SBU ~17 ~12 70
LAION-zh 108 105 97
In-house 220 220 100
Total 5,000 1,400 28

3. Alignment of Visual Grounding and Text Reading

Qwen3-VL integrates grounding and text-reading tasks into the core sequence-to-sequence paradigm. All boxed outputs and references are generated as serialized tokens, eliminating the need for explicit bounding-box regression heads. For grounding, outputs follow:

1
<ref>phrase</ref><box>(x_{tl},y_{tl}),(x_{br},y_{br})</box>
For OCR with grounding:
1
OCR with grounding: <ref>…</ref><quad>(x_1,y_1)…(x_4,y_4)
Within AndesVL, grounding and UI understanding leverage scenario-specialized adapters. Parameter-efficient adaptation is enabled by the 1+N LoRA approach: for each frozen weight matrix WW, adapters for each scenario ss learn low-rank update factors such that

W′=W+∑s∈SU(s)V(s)TW' = W + \sum_{s\in\mathcal S} U^{(s)} V^{(s)T}

This enables adding, removing, or mixing scenario-specialized capabilities (e.g., OCR, UI, chart parsing) with minimal overhead and no re-quantization of the main model.

4. Benchmark Performance

Qwen3-VL and AndesVL variants achieve state-of-the-art or first-tier results across a suite of vision-language evaluation domains at both cloud and edge scales.

Qwen-VL Family (Table 3/4/5 Extracts):

Model Nocaps Flickr30K VQAv2 OKVQA GQA VizWiz TextVQA DocVQA OCR-VQA
Qwen-VL 121.4 85.8 79.5 58.6 59.3 35.2 63.8 65.1 75.7
Qwen-VL-Chat 120.2 81.0 78.2 56.6 57.5 38.9 61.5 62.6 70.5

AndesVL ("Thinking" models, Table 3):

Model Text-rich Reasoning & Math Multi-img Gen VQA Halluc. Multiling. Overall
AndesVL-4B-Thinking 86.0 58.3 67.8 73.8 74.8 64.9 70.9
InternVL3.5-4B 82.6 56.9 62.3 72.8 69.6 62.1 67.7

Few-shot in-context learning: Qwen-VL 7B's performance curves on OKVQA, VizWiz, TextVQA, and Flickr30K exceed Flamingo-9B/80B and IDEFICS models.

On real-world multimodal instruction (TouchStone, SEED-Bench, MME), Qwen-VL-Chat leads with up to +43 points over InstructBLIP, and provides full Chinese language support.

Latency and memory for AndesVL models on Dimensity 9500 (floating-point baseline):

  • 4B: ~2.4 GB RAM, ~3 tokens/s;
  • 2B: ~1.2 GB, ~5 tokens/s;
  • 1B: ~650 MB, ~9 tokens/s;
  • 0.6B: ~400 MB, ~15 tokens/s.

5. Parameter- and Memory-Efficient Fine-Tuning

AndesVL introduces the 1+N LoRA approach for efficient scenario-specific adaptation. This method freezes the core Qwen3 weights and trains per-scenario adapters, each with low-rank decomposition. Multiple adapters can be loaded or swapped without re-quantizing the base model, yielding O(N⋅2dr)O(N·2dr) parameter overhead per scenario (r≪dr ≪ d), favorable for mobile device memory and task modularity.

Quantization-aware training (QAT) is supported for weights (2-8 bits) and activations (8/16 bits, mixed-precision). AndesVL-4B achieves 95.8% Top-1 overlap with FP32 baselines on OCR benchmarks under QAT plus PTQ. Quantization-aware LoRA fine-tuning (QALFT) preserves <<3% accuracy degradation vs FP32 LoRA.

6. Mobile Deployment Optimizations

Qwen3-VL's AndesVL line implements several hardware-aware optimizations for real-time mobile execution:

  • Cache Eviction (OKV): Maintains linear K/V store by evicting entries with minimal attention. At 50% eviction, Rouge-L only drops to 0.39 vs 0.42 baseline, superior to SnapKV.
  • Speculative Decoding: Lightweight draft models propose token blocks in parallel, achieving block efficiency up to 7.9, combined with up to 20% structural sparsity for 6.7x speedup, 1.8 bits/weight, and 30.9% RAM reduction.
  • On-device Compression: Structured sparsification and hardware-aware quantization cumulatively enhance speed and memory without substantial performance loss, as illustrated by the following table:
Technique Speedup Bits/weight
PTQ only (baseline) 1.0× 3.0
+ Hardware-aware compression 1.1× 3.0
+ Structured sparsification 1.6× 1.8
+ Speculative decoding 6.7× 1.8

These strategies enable AndesVL models up to 4B parameters to run in real-time on Dimensity 9500-class phones.

7. Public Availability and Research Implications

All primary Qwen-VL code, checkpoints, and model variations are open-sourced, with demos at https://github.com/QwenLM/Qwen-VL (Bai et al., 2023). The consolidation of cloud-scale and mobile-optimized LVLMs under the Qwen3-VL umbrella suggests a practical convergence of state-of-the-art vision-language reasoning with deployment efficiency. The explicit unification of vision-language alignment, parameter-efficient adaptation (1+N LoRA), and sophisticated quantization augurs expanded accessibility and customization for real-world multimodal applications. A plausible implication is accelerated progress in on-device multimodal AI for diverse languages and domains, as well as broadened research in LVLM adaptation, benchmarking, and efficiency-aware model design (Jin et al., 13 Oct 2025).

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen3-VL Model Series.