Qwen-VL-Chat: Advanced Multimodal LVLM
- Qwen-VL-Chat is a multimodal vision-language model that combines image understanding and text interpretation for advanced dialogue and reasoning tasks.
- It employs a ViT-bigG based encoder and a three-stage training pipeline to merge high-resolution visual inputs with language models effectively.
- The model sets new benchmarks in tasks like image captioning, VQA, and OCR, demonstrating robust performance in both general and text-oriented evaluations.
Qwen-VL-Chat is a multimodal large vision-LLM (LVLM) developed within the Qwen-VL series, characterized by integrated visual and linguistic reasoning, grounded localization, and advanced text-reading abilities. Built atop the Qwen-LM 7B parameter LLM, Qwen-VL-Chat establishes new state-of-the-art results for generalist LVLMs of similar scale across a spectrum of benchmarks, including image captioning, visual question answering (VQA), visual grounding, and real-world multimodal dialogue tasks (Bai et al., 2023).
1. Model Architecture
1.1 Visual Receptor
The vision encoder is based on ViT-bigG (OpenCLIP) with patch size 14, pre-trained on CLIP. During Stage 1 pre-training, images are input at resolution, yielding a sequence length of 256; in subsequent stages, the input is with a sequence length of 1024. The encoder outputs a sequence of per-patch embeddings where .
Following the encoder, a single cross-attention "position-aware vision–language adapter" is applied, featuring 256 learnable query vectors . Keys/values derive from ViT patch embeddings with added 2D positional encoding . The attention mechanism receives a 2D positional bias applied to each query–key dot product:
This compresses the variable-length ViT output to a fixed-length sequence of 256 "visual tokens," later mapped to the LLM embedding space.
1.2 Input–Output Interface
Image inputs are marked by <img>…</img> tokens; visual tokens are interleaved into the LLM’s input space. Bounding boxes for outputs are normalized to , formatted as "(x₁,y₁),(x₂,y₂)", and wrapped in <box>…</box> tokens; referenced regions use <ref>…</ref>. All tokens (textual and visual) are concatenated and autoregressively processed by the Transformer LLM (Qwen-7B). At inference, the model produces serializations for answers, captions, or box coordinates via greedy decoding (Bai et al., 2023).
2. Three-Stage Training Pipeline
2.1 Stage 1: Vision–Language Pre-training
This stage utilizes 1.4 billion cleaned image-text pairs (77.3% English, 22.7% Chinese). The LLM is frozen, while the ViT and adapter are trained. The loss is a standard autoregressive cross-entropy over text tokens:
Input images use a resolution, and the batch size is 30,720, over 50,000 steps.
2.2 Stage 2: Multi-Task Pre-training
This stage covers seven tasks and approximately 69 million samples, including image captioning, VQA, grounding, referring grounding, grounded captioning, OCR, and pure-text autoregression. High-resolution images () and multi-task sequences up to length 2048 are used. All model parameters are updated. Multi-task cross-entropy is applied:
is typically uniform. Training proceeds for 19,000 steps with batch size 4,096.
2.3 Stage 3: Supervised Fine-Tuning (SFT) — Qwen-VL-Chat
Instruction-tuning uses approximately 350,000 multimodal instruction–response pairs in ChatML format. Dialogues span LLM-generated single-image, manually annotated multi-image/multi-round, grounding, text-reading, and pure-text conversations. The visual encoder is frozen; only the adapter and LLM are trained with cross-entropy on assistant reply tokens:
Training runs for 8,000 steps with batch size 128.
3. Multilingual and Multimodal Data Pipeline
The pre-training corpus is drawn from multiple sources (LAION-en, LAION-COCO, DataComp, COYO, CC12M, CC3M, SBU, COCO Captions, LAION-zh, in-house Chinese) resulting in 1.4 billion filtered image-text pairs (retention rate 28% from an initial 5 billion). The data-cleaning procedure includes:
- Filtering on aspect ratio and image size.
- Removal based on CLIP score, non-Latin characters, emojis, HTML, extreme sequence lengths.
- Deduplication of academic captions, retaining the longest.
The pipeline incorporates both synthetic and real data, supporting multilingual and multimodal learning at scale.
4. Grounding and Text-Reading Capabilities
4.1 Grounding
For spatial localization and phrase grounding, Qwen-VL-Chat is trained on noun/phrase grounding datasets (GRiT, Visual Genome, RefCOCO/+/g) with 8.7M samples each for referring grounding and grounded captioning. Phrases to be localized are wrapped in <ref>…</ref>, followed by their bounding boxes in <box>…</box>. Grounding is learned autoregressively from the token serialization rather than with a separate regression loss.
4.2 Text Reading (OCR)
The model leverages 24.8 million synthetic OCR samples (SynthDoG on COCO backgrounds, English/Chinese fonts), as well as rendered web-crawled PDFs and HTML, pairing rendered images with text and quadrilateral coordinates. The task is denoted by a textual prefix (“OCR with grounding:”) and quads encoded in <quad>…</quad>. The same cross-entropy framework is used for both content recognition and positional grounding.
5. Instruction Tuning and Conversational Protocol
Qwen-VL-Chat is distinguished by full instruction tuning for multi-turn multimodal dialogue. Training data is a mixture of LLM-generated and hand-annotated conversations, including multi-image, multi-round, localization, and text-reading examples, together with pure-text dialogues for LLM skill retention. The ChatML format formalizes dialog turns with <im_start>, <im_end>, and manages multiple images as “Picture i: <img>…</img>.”
During SFT, the objective is always cross-entropy loss on assistant-side outputs, with the vision encoder weights held fixed for stability. This facilitates robust multimodal conversational capabilities encompassing image, text, spatial, and linguistic reasoning.
6. Benchmark Performance
Qwen-VL-Chat, together with its base model Qwen-VL, achieves leading results across multiple visual-linguistic evaluation protocols:
Image Captioning and General VQA (Zero-Shot)
- Nocaps (val): Qwen-VL-Chat 120.2
- Flickr30K (karpathy): 81.0
- VQAv2: 78.2
- OKVQA: 56.6
- GQA: 57.5
- ScienceQA-Img: 68.2
- VizWiz: 38.9
Text-Oriented VQA
- TextVQA: 61.5
- DocVQA (ANLS): 62.6
- ChartQA: 66.3
- AI2D: 57.7
- OCR-VQA: 70.5
Referring Expression Comprehension
- RefCOCO (val/testA/testB): 88.55/92.27/84.51
- Qwen-VL-Chat achieves top-tier results on RefCOCO+/g and GRiT as well.
Few-Shot In-Context Learning
On OKVQA, VizWiz, TextVQA, and Flickr30K, Qwen-VL (7B) surpasses Flamingo-9B and demonstrably rivals Flamingo-80B performance in 1–8 shot settings.
Real-World Multimodal Dialogue
- TouchStone (GPT-4 score): 645.2 (English), 401.2 (Chinese); best prior ≈605 (English)
- SEED-Bench (MC VQA): 58.2 overall (Image 65.4, Video 37.8); vs. InstructBLIP 53.4
- MME (Yes/No per-question): Perception 1487.6, Cognition 360.7; best prior ≈1212/292
A plausible implication is that the three-stage curriculum (weakly supervised pre-training, multi-task alignment, fine-grained instruction tuning) tightly integrates visual and linguistic processing, enabling flexible, high-performance multimodal reasoning (Bai et al., 2023).