Qwen3-VL: Scalable Cloud & Mobile LVLMs
- Qwen3-VL Model Series is a collection of large-scale vision-language models that integrate advanced multimodal pretraining, scalable architecture design, and hardware-aware optimizations.
- It features both cloud-oriented and mobile-focused variants, achieving robust performance across image captioning, VQA, visual grounding, and multilingual multimodal understanding.
- Innovative techniques like the 1+N LoRA approach and quantization-aware training ensure parameter efficiency and enable real-time on-device deployment.
Qwen3-VL Model Series comprises large-scale vision-LLMs (LVLMs) designed to integrate state-of-the-art cross-modal reasoning with efficient mobile deployment. Built on the Qwen and Qwen3 LLM families, Qwen3-VL unifies advances in multimodal pretraining, scalable architecture design, parameter-efficient adaptation, and hardware-aware optimization. The series includes both cloud-oriented (Qwen-VL, Qwen-VL-Chat) and mobile-side (AndesVL) variants, and demonstrates leading benchmark performance across image captioning, general and text-oriented visual question answering (VQA), visual grounding, multilingual multimodal understanding, and multimodal instruction-following (Bai et al., 2023, Jin et al., 13 Oct 2025).
1. Architectural Design
Qwen3-VL models employ a three-component architecture: a high-capacity visual encoder, a projection adapter, and a transformer-based LLM. In the original Qwen-VL, a frozen Vision Transformer (ViT-bigG) encodes the image as patch tokens, which are enriched with 2D absolute positional encodings. A learnable, position-aware VL Adapter implements cross-attention from query embeddings () to compress the ViT output into a fixed-length sequence (), serialized and injected into the LLM as, e.g., <img>... </img> special tokens.
AndesVL (the Qwen3-VL mobile-focused line) generalizes this paradigm. Its structure consists of:
- A ViT-based visual encoder (e.g., AIMv2-Large or SigLIP2-Base) with native-resolution, 2D-RoPE positional embeddings;
- Pixel-shuffle downsampling (groups patches into macro-patches);
- An MLP projector aligning visual features to the LLM token dimension:
where , , ;
- A Qwen3 transformer LLM, available in configurations with 0.6B, 1.7B, or 4B parameters.
Data is formatted for the autoregressive LLM as a unified input sequence containing textual content, serialized bounding boxes (<box>(x_{tl},y_{tl}),(x_{br},y_{br})</box>), image tokens, and references.
| Model | Parameters (B) | Vision Encoder | LLM |
|---|---|---|---|
| AndesVL-0.6B | 0.695 | SigLIP2-Base | Qwen3-0.6B |
| AndesVL-1B | 0.927 | AIMv2-Large | Qwen3-0.6B |
| AndesVL-2B | 2.055 | AIMv2-Large | Qwen3-1.7B |
| AndesVL-4B | 4.360 | AIMv2-Large | Qwen3-4B |
2. Training Regimens and Multimodal Data
Qwen3-VL training follows a staged pipeline utilizing a cleaned, multilingual, multimodal corpus.
Qwen-VL Stages:
- Vision-Language Pre-training:
- 1.4B image-caption pairs (77.3% English, 22.7% Chinese); ViT and adapter are optimized under cross-entropy while the LM is frozen.
- Multi-Task Pre-training:
- Seven interleaved tasks (captioning, VQA, visual grounding, referring expressions, grounded captioning, OCR, text auto-regression); LLM is unfrozen for joint optimization; mixed sequence lengths up to 2048.
- Instruction Tuning:
- 350K multimodal dialogues (ChatML format) for Qwen-VL-Chat; ViT frozen, LLM and adapter tuned.
AndesVL Stages:
- Vision–Language Alignment:
- Captions, OCR, VQA; only ViT and MLP trainable.
- Joint Pre-training:
- Full model trained on mixed image–text and text-only data.
- Multi-task Pre-training:
- Tasks include VQA, captioning, OCR, UI, and multi-modal chain-of-thought (CoT) reasoning (CoT only in "Thinking" models).
Post-training steps include SFT (supervised fine-tuning, 16M ChatML-formatted examples), Mixed Preference Optimization (MPO, 80K preference pairs), and, in "Thinking" models, on-policy RL with a curriculum of STEM-centric multi-modal chain-of-thought samples.
Data cleaning involves aspect-ratio filtering, CLIP-score ranking, language and length controls, and OCR-specific augmentation. For grounding, balanced "Text-in-Region"/"Region-in-Text" instances are curated.
| Source | Original (M) | Cleaned (M) | Remain (%) |
|---|---|---|---|
| LAION-en | 2,000 | 280 | 14 |
| LAION-COCO | 600 | 300 | 50 |
| DataComp | 1,400 | 300 | 21 |
| Coyo | 700 | 200 | 28 |
| CC12M/CC3M/SBU | ~17 | ~12 | 70 |
| LAION-zh | 108 | 105 | 97 |
| In-house | 220 | 220 | 100 |
| Total | 5,000 | 1,400 | 28 |
3. Alignment of Visual Grounding and Text Reading
Qwen3-VL integrates grounding and text-reading tasks into the core sequence-to-sequence paradigm. All boxed outputs and references are generated as serialized tokens, eliminating the need for explicit bounding-box regression heads. For grounding, outputs follow:
1 |
<ref>phrase</ref><box>(x_{tl},y_{tl}),(x_{br},y_{br})</box> |
1 |
OCR with grounding: <ref>…</ref><quad>(x_1,y_1)…(x_4,y_4) |
This enables adding, removing, or mixing scenario-specialized capabilities (e.g., OCR, UI, chart parsing) with minimal overhead and no re-quantization of the main model.
4. Benchmark Performance
Qwen3-VL and AndesVL variants achieve state-of-the-art or first-tier results across a suite of vision-language evaluation domains at both cloud and edge scales.
Qwen-VL Family (Table 3/4/5 Extracts):
| Model | Nocaps | Flickr30K | VQAv2 | OKVQA | GQA | VizWiz | TextVQA | DocVQA | OCR-VQA |
|---|---|---|---|---|---|---|---|---|---|
| Qwen-VL | 121.4 | 85.8 | 79.5 | 58.6 | 59.3 | 35.2 | 63.8 | 65.1 | 75.7 |
| Qwen-VL-Chat | 120.2 | 81.0 | 78.2 | 56.6 | 57.5 | 38.9 | 61.5 | 62.6 | 70.5 |
AndesVL ("Thinking" models, Table 3):
| Model | Text-rich | Reasoning & Math | Multi-img | Gen VQA | Halluc. | Multiling. | Overall |
|---|---|---|---|---|---|---|---|
| AndesVL-4B-Thinking | 86.0 | 58.3 | 67.8 | 73.8 | 74.8 | 64.9 | 70.9 |
| InternVL3.5-4B | 82.6 | 56.9 | 62.3 | 72.8 | 69.6 | 62.1 | 67.7 |
Few-shot in-context learning: Qwen-VL 7B's performance curves on OKVQA, VizWiz, TextVQA, and Flickr30K exceed Flamingo-9B/80B and IDEFICS models.
On real-world multimodal instruction (TouchStone, SEED-Bench, MME), Qwen-VL-Chat leads with up to +43 points over InstructBLIP, and provides full Chinese language support.
Latency and memory for AndesVL models on Dimensity 9500 (floating-point baseline):
- 4B: ~2.4 GB RAM, ~3 tokens/s;
- 2B: ~1.2 GB, ~5 tokens/s;
- 1B: ~650 MB, ~9 tokens/s;
- 0.6B: ~400 MB, ~15 tokens/s.
5. Parameter- and Memory-Efficient Fine-Tuning
AndesVL introduces the 1+N LoRA approach for efficient scenario-specific adaptation. This method freezes the core Qwen3 weights and trains per-scenario adapters, each with low-rank decomposition. Multiple adapters can be loaded or swapped without re-quantizing the base model, yielding parameter overhead per scenario (), favorable for mobile device memory and task modularity.
Quantization-aware training (QAT) is supported for weights (2-8 bits) and activations (8/16 bits, mixed-precision). AndesVL-4B achieves 95.8% Top-1 overlap with FP32 baselines on OCR benchmarks under QAT plus PTQ. Quantization-aware LoRA fine-tuning (QALFT) preserves 3% accuracy degradation vs FP32 LoRA.
6. Mobile Deployment Optimizations
Qwen3-VL's AndesVL line implements several hardware-aware optimizations for real-time mobile execution:
- Cache Eviction (OKV): Maintains linear K/V store by evicting entries with minimal attention. At 50% eviction, Rouge-L only drops to 0.39 vs 0.42 baseline, superior to SnapKV.
- Speculative Decoding: Lightweight draft models propose token blocks in parallel, achieving block efficiency up to 7.9, combined with up to 20% structural sparsity for 6.7x speedup, 1.8 bits/weight, and 30.9% RAM reduction.
- On-device Compression: Structured sparsification and hardware-aware quantization cumulatively enhance speed and memory without substantial performance loss, as illustrated by the following table:
| Technique | Speedup | Bits/weight |
|---|---|---|
| PTQ only (baseline) | 1.0× | 3.0 |
| + Hardware-aware compression | 1.1× | 3.0 |
| + Structured sparsification | 1.6× | 1.8 |
| + Speculative decoding | 6.7× | 1.8 |
These strategies enable AndesVL models up to 4B parameters to run in real-time on Dimensity 9500-class phones.
7. Public Availability and Research Implications
All primary Qwen-VL code, checkpoints, and model variations are open-sourced, with demos at https://github.com/QwenLM/Qwen-VL (Bai et al., 2023). The consolidation of cloud-scale and mobile-optimized LVLMs under the Qwen3-VL umbrella suggests a practical convergence of state-of-the-art vision-language reasoning with deployment efficiency. The explicit unification of vision-language alignment, parameter-efficient adaptation (1+N LoRA), and sophisticated quantization augurs expanded accessibility and customization for real-world multimodal applications. A plausible implication is accelerated progress in on-device multimodal AI for diverse languages and domains, as well as broadened research in LVLM adaptation, benchmarking, and efficiency-aware model design (Jin et al., 13 Oct 2025).