Qwen3-VL: Scalable Cloud & Mobile LVLMs

Updated 26 November 2025

Qwen3-VL Model Series is a collection of large-scale vision-language models that integrate advanced multimodal pretraining, scalable architecture design, and hardware-aware optimizations.
It features both cloud-oriented and mobile-focused variants, achieving robust performance across image captioning, VQA, visual grounding, and multilingual multimodal understanding.
Innovative techniques like the 1+N LoRA approach and quantization-aware training ensure parameter efficiency and enable real-time on-device deployment.

Qwen3-VL Model Series comprises large-scale vision-LLMs (LVLMs) designed to integrate state-of-the-art cross-modal reasoning with efficient mobile deployment. Built on the Qwen and Qwen3 LLM families, Qwen3-VL unifies advances in multimodal pretraining, scalable architecture design, parameter-efficient adaptation, and hardware-aware optimization. The series includes both cloud-oriented (Qwen-VL, Qwen-VL-Chat) and mobile-side (AndesVL) variants, and demonstrates leading benchmark performance across image captioning, general and text-oriented visual question answering (VQA), visual grounding, multilingual multimodal understanding, and multimodal instruction-following (Bai et al., 2023, Jin et al., 13 Oct 2025).

1. Architectural Design

Qwen3-VL models employ a three-component architecture: a high-capacity visual encoder, a projection adapter, and a transformer-based LLM. In the original Qwen-VL, a frozen Vision Transformer (ViT-bigG) encodes the image as patch tokens, which are enriched with 2D absolute positional encodings. A learnable, position-aware VL Adapter implements cross-attention from query embeddings ( $Q_0\in\mathbb{R}^{256\times d}$ ) to compress the ViT output into a fixed-length sequence ( $Z\in\mathbb{R}^{256\times d}$ ), serialized and injected into the LLM as, e.g., <img>... </img> special tokens.

AndesVL (the Qwen3-VL mobile-focused line) generalizes this paradigm. Its structure consists of:

A ViT-based visual encoder (e.g., AIMv2-Large or SigLIP2-Base) with native-resolution, 2D-RoPE positional embeddings;
Pixel-shuffle downsampling (groups $4\times4$ patches into macro-patches);
An MLP projector aligning visual features to the LLM token dimension:

$Z = \mathrm{ReLU}(XW_1 + b_1)W_2 + b_2$

where $X\in\mathbb{R}^{N\times D_v}$ , $W_1\in\mathbb{R}^{D_v\times R}$ , $W_2\in\mathbb{R}^{R\times D_\ell}$ ;

A Qwen3 transformer LLM, available in configurations with 0.6B, 1.7B, or 4B parameters.

Data is formatted for the autoregressive LLM as a unified input sequence containing textual content, serialized bounding boxes (<box>(x_{tl},y_{tl}),(x_{br},y_{br})</box>), image tokens, and references.

Model	Parameters (B)	Vision Encoder	LLM
AndesVL-0.6B	0.695	SigLIP2-Base	Qwen3-0.6B
AndesVL-1B	0.927	AIMv2-Large	Qwen3-0.6B
AndesVL-2B	2.055	AIMv2-Large	Qwen3-1.7B
AndesVL-4B	4.360	AIMv2-Large	Qwen3-4B

2. Training Regimens and Multimodal Data

Qwen3-VL training follows a staged pipeline utilizing a cleaned, multilingual, multimodal corpus.

Qwen-VL Stages:

Vision-Language Pre-training:
- 1.4B image-caption pairs (77.3% English, 22.7% Chinese); ViT and adapter are optimized under cross-entropy while the LM is frozen.
Multi-Task Pre-training:
- Seven interleaved tasks (captioning, VQA, visual grounding, referring expressions, grounded captioning, OCR, text auto-regression); LLM is unfrozen for joint optimization; mixed sequence lengths up to 2048.
Instruction Tuning:
- 350K multimodal dialogues (ChatML format) for Qwen-VL-Chat; ViT frozen, LLM and adapter tuned.

AndesVL Stages:

Vision–Language Alignment:
- Captions, OCR, VQA; only ViT and MLP trainable.
Joint Pre-training:
- Full model trained on mixed image–text and text-only data.
Multi-task Pre-training:
- Tasks include VQA, captioning, OCR, UI, and multi-modal chain-of-thought (CoT) reasoning (CoT only in "Thinking" models).

Post-training steps include SFT (supervised fine-tuning, 16M ChatML-formatted examples), Mixed Preference Optimization (MPO, 80K preference pairs), and, in "Thinking" models, on-policy RL with a curriculum of STEM-centric multi-modal chain-of-thought samples.

Data cleaning involves aspect-ratio filtering, CLIP-score ranking, language and length controls, and OCR-specific augmentation. For grounding, balanced "Text-in-Region"/"Region-in-Text" instances are curated.

Source	Original (M)	Cleaned (M)	Remain (%)
LAION-en	2,000	280	14
LAION-COCO	600	300	50
DataComp	1,400	300	21
Coyo	700	200	28
CC12M/CC3M/SBU	~17	~12	70
LAION-zh	108	105	97
In-house	220	220	100
Total	5,000	1,400	28

3. Alignment of Visual Grounding and Text Reading

Qwen3-VL integrates grounding and text-reading tasks into the core sequence-to-sequence paradigm. All boxed outputs and references are generated as serialized tokens, eliminating the need for explicit bounding-box regression heads. For grounding, outputs follow:

1	<ref>phrase</ref><box>(x_{tl},y_{tl}),(x_{br},y_{br})</box>

For OCR with grounding:

1	OCR with grounding: <ref>…</ref><quad>(x_1,y_1)…(x_4,y_4)

Within AndesVL, grounding and UI understanding leverage scenario-specialized adapters. Parameter-efficient adaptation is enabled by the 1+N LoRA approach: for each frozen weight matrix

W

, adapters for each scenario

s

learn low-rank update factors such that

$W' = W + \sum_{s\in\mathcal S} U^{(s)} V^{(s)T}$

This enables adding, removing, or mixing scenario-specialized capabilities (e.g., OCR, UI, chart parsing) with minimal overhead and no re-quantization of the main model.

4. Benchmark Performance

Qwen3-VL and AndesVL variants achieve state-of-the-art or first-tier results across a suite of vision-language evaluation domains at both cloud and edge scales.

Qwen-VL Family (Table 3/4/5 Extracts):

Model	Nocaps	Flickr30K	VQAv2	OKVQA	GQA	VizWiz	TextVQA	DocVQA	OCR-VQA
Qwen-VL	121.4	85.8	79.5	58.6	59.3	35.2	63.8	65.1	75.7
Qwen-VL-Chat	120.2	81.0	78.2	56.6	57.5	38.9	61.5	62.6	70.5

AndesVL ("Thinking" models, Table 3):

Model	Text-rich	Reasoning & Math	Multi-img	Gen VQA	Halluc.	Multiling.	Overall
AndesVL-4B-Thinking	86.0	58.3	67.8	73.8	74.8	64.9	70.9
InternVL3.5-4B	82.6	56.9	62.3	72.8	69.6	62.1	67.7

Few-shot in-context learning: Qwen-VL 7B's performance curves on OKVQA, VizWiz, TextVQA, and Flickr30K exceed Flamingo-9B/80B and IDEFICS models.

On real-world multimodal instruction (TouchStone, SEED-Bench, MME), Qwen-VL-Chat leads with up to +43 points over InstructBLIP, and provides full Chinese language support.

Latency and memory for AndesVL models on Dimensity 9500 (floating-point baseline):

4B: ~2.4 GB RAM, ~3 tokens/s;
2B: ~1.2 GB, ~5 tokens/s;
1B: ~650 MB, ~9 tokens/s;
0.6B: ~400 MB, ~15 tokens/s.

5. Parameter- and Memory-Efficient Fine-Tuning

AndesVL introduces the 1+N LoRA approach for efficient scenario-specific adaptation. This method freezes the core Qwen3 weights and trains per-scenario adapters, each with low-rank decomposition. Multiple adapters can be loaded or swapped without re-quantizing the base model, yielding $O(N·2dr)$ parameter overhead per scenario ( $r ≪ d$ ), favorable for mobile device memory and task modularity.

Quantization-aware training (QAT) is supported for weights (2-8 bits) and activations (8/16 bits, mixed-precision). AndesVL-4B achieves 95.8% Top-1 overlap with FP32 baselines on OCR benchmarks under QAT plus PTQ. Quantization-aware LoRA fine-tuning (QALFT) preserves $<$ 3% accuracy degradation vs FP32 LoRA.

6. Mobile Deployment Optimizations

Qwen3-VL's AndesVL line implements several hardware-aware optimizations for real-time mobile execution:

Cache Eviction (OKV): Maintains linear K/V store by evicting entries with minimal attention. At 50% eviction, Rouge-L only drops to 0.39 vs 0.42 baseline, superior to SnapKV.
Speculative Decoding: Lightweight draft models propose token blocks in parallel, achieving block efficiency up to 7.9, combined with up to 20% structural sparsity for 6.7x speedup, 1.8 bits/weight, and 30.9% RAM reduction.
On-device Compression: Structured sparsification and hardware-aware quantization cumulatively enhance speed and memory without substantial performance loss, as illustrated by the following table:

Technique	Speedup	Bits/weight
PTQ only (baseline)	1.0×	3.0
+ Hardware-aware compression	1.1×	3.0
+ Structured sparsification	1.6×	1.8
+ Speculative decoding	6.7×	1.8

These strategies enable AndesVL models up to 4B parameters to run in real-time on Dimensity 9500-class phones.

7. Public Availability and Research Implications

All primary Qwen-VL code, checkpoints, and model variations are open-sourced, with demos at https://github.com/QwenLM/Qwen-VL (Bai et al., 2023). The consolidation of cloud-scale and mobile-optimized LVLMs under the Qwen3-VL umbrella suggests a practical convergence of state-of-the-art vision-language reasoning with deployment efficiency. The explicit unification of vision-language alignment, parameter-efficient adaptation (1+N LoRA), and sophisticated quantization augurs expanded accessibility and customization for real-world multimodal applications. A plausible implication is accelerated progress in on-device multimodal AI for diverse languages and domains, as well as broadened research in LVLM adaptation, benchmarking, and efficiency-aware model design (Jin et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond (2023)

AndesVL Technical Report: An Efficient Mobile-side Multimodal Large Language Model (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Qwen3-VL Model Series.

Qwen3-VL: Scalable Cloud & Mobile LVLMs

1. Architectural Design

2. Training Regimens and Multimodal Data

3. Alignment of Visual Grounding and Text Reading

4. Benchmark Performance

5. Parameter- and Memory-Efficient Fine-Tuning

6. Mobile Deployment Optimizations

7. Public Availability and Research Implications

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Qwen3-VL: Scalable Cloud & Mobile LVLMs

1. Architectural Design

2. Training Regimens and Multimodal Data

3. Alignment of Visual Grounding and Text Reading

4. Benchmark Performance

5. Parameter- and Memory-Efficient Fine-Tuning

6. Mobile Deployment Optimizations

7. Public Availability and Research Implications

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research