Papers
Topics
Authors
Recent
2000 character limit reached

Qwen-VL-Chat: Advanced Multimodal LVLM

Updated 24 December 2025
  • Qwen-VL-Chat is a multimodal vision-language model that combines image understanding and text interpretation for advanced dialogue and reasoning tasks.
  • It employs a ViT-bigG based encoder and a three-stage training pipeline to merge high-resolution visual inputs with language models effectively.
  • The model sets new benchmarks in tasks like image captioning, VQA, and OCR, demonstrating robust performance in both general and text-oriented evaluations.

Qwen-VL-Chat is a multimodal large vision-LLM (LVLM) developed within the Qwen-VL series, characterized by integrated visual and linguistic reasoning, grounded localization, and advanced text-reading abilities. Built atop the Qwen-LM 7B parameter LLM, Qwen-VL-Chat establishes new state-of-the-art results for generalist LVLMs of similar scale across a spectrum of benchmarks, including image captioning, visual question answering (VQA), visual grounding, and real-world multimodal dialogue tasks (Bai et al., 2023).

1. Model Architecture

1.1 Visual Receptor

The vision encoder is based on ViT-bigG (OpenCLIP) with patch size 14, pre-trained on CLIP. During Stage 1 pre-training, images are input at 224×224224 \times 224 resolution, yielding a sequence length of 256; in subsequent stages, the input is 448×448448 \times 448 with a sequence length of 1024. The encoder outputs a sequence {fiRdv}i=1L\{f_i \in \mathbb{R}^{d_v}\}_{i=1}^L of per-patch embeddings where L=(H/14)(W/14)L=(H/14)\cdot (W/14).

Following the encoder, a single cross-attention "position-aware vision–language adapter" is applied, featuring 256 learnable query vectors {qj}j=1256\{q_j\}_{j=1}^{256}. Keys/values derive from ViT patch embeddings fif_i with added 2D positional encoding pip_i. The attention mechanism receives a 2D positional bias ϕpos\phi_{\mathrm{pos}} applied to each query–key dot product:

Attn(Q,K,V)=softmax((QWq)(KWk)/d+ϕpos)(VWv)\operatorname{Attn}(Q, K, V) = \operatorname{softmax}\left( (QW_q)(KW_k)^\top/\sqrt{d} + \phi_{\mathrm{pos}} \right) (VW_v)

This compresses the variable-length ViT output to a fixed-length sequence of 256 "visual tokens," later mapped to the LLM embedding space.

1.2 Input–Output Interface

Image inputs are marked by <img>…</img> tokens; visual tokens are interleaved into the LLM’s input space. Bounding boxes for outputs are normalized to [0,1000)[0,1000), formatted as "(x₁,y₁),(x₂,y₂)", and wrapped in <box>…</box> tokens; referenced regions use <ref>…</ref>. All tokens (textual and visual) are concatenated and autoregressively processed by the Transformer LLM (Qwen-7B). At inference, the model produces serializations for answers, captions, or box coordinates via greedy decoding (Bai et al., 2023).

2. Three-Stage Training Pipeline

2.1 Stage 1: Vision–Language Pre-training

This stage utilizes 1.4 billion cleaned image-text pairs (77.3% English, 22.7% Chinese). The LLM is frozen, while the ViT and adapter are trained. The loss is a standard autoregressive cross-entropy over text tokens:

L1=(I,T)t=1TlogP(TtT<t,I)L_1 = - \sum_{(I,T)} \sum_{t=1}^{|T|} \log P(T_t\,|\,T_{<t},\,I)

Input images use a 224×224224 \times 224 resolution, and the batch size is 30,720, over 50,000 steps.

2.2 Stage 2: Multi-Task Pre-training

This stage covers seven tasks and approximately 69 million samples, including image captioning, VQA, grounding, referring grounding, grounded captioning, OCR, and pure-text autoregression. High-resolution images (448×448448\times448) and multi-task sequences up to length 2048 are used. All model parameters are updated. Multi-task cross-entropy is applied:

L2=taskαtaskLCE(task),      LCE(task)=logP(labelthistory)L_2 = \sum_{\textrm{task}} \alpha_\textrm{task} \cdot L_{CE}(\textrm{task}),\;\;\; L_{CE}(\textrm{task}) = -\sum \log P(\textrm{label}_t\,|\,\textrm{history})

αtask\alpha_\textrm{task} is typically uniform. Training proceeds for 19,000 steps with batch size 4,096.

2.3 Stage 3: Supervised Fine-Tuning (SFT) — Qwen-VL-Chat

Instruction-tuning uses approximately 350,000 multimodal instruction–response pairs in ChatML format. Dialogues span LLM-generated single-image, manually annotated multi-image/multi-round, grounding, text-reading, and pure-text conversations. The visual encoder is frozen; only the adapter and LLM are trained with cross-entropy on assistant reply tokens:

L3=exampletanswerlogP(athistory)L_3 = - \sum_{\textrm{example}} \sum_{t \in \text{answer}} \log P(a_t\,|\,\textrm{history})

Training runs for 8,000 steps with batch size 128.

3. Multilingual and Multimodal Data Pipeline

The pre-training corpus is drawn from multiple sources (LAION-en, LAION-COCO, DataComp, COYO, CC12M, CC3M, SBU, COCO Captions, LAION-zh, in-house Chinese) resulting in 1.4 billion filtered image-text pairs (retention rate 28% from an initial 5 billion). The data-cleaning procedure includes:

  1. Filtering on aspect ratio and image size.
  2. Removal based on CLIP score, non-Latin characters, emojis, HTML, extreme sequence lengths.
  3. Deduplication of academic captions, retaining the longest.

The pipeline incorporates both synthetic and real data, supporting multilingual and multimodal learning at scale.

4. Grounding and Text-Reading Capabilities

4.1 Grounding

For spatial localization and phrase grounding, Qwen-VL-Chat is trained on noun/phrase grounding datasets (GRiT, Visual Genome, RefCOCO/+/g) with \sim8.7M samples each for referring grounding and grounded captioning. Phrases to be localized are wrapped in <ref>…</ref>, followed by their bounding boxes in <box>…</box>. Grounding is learned autoregressively from the token serialization rather than with a separate regression loss.

4.2 Text Reading (OCR)

The model leverages 24.8 million synthetic OCR samples (SynthDoG on COCO backgrounds, English/Chinese fonts), as well as rendered web-crawled PDFs and HTML, pairing rendered images with text and quadrilateral coordinates. The task is denoted by a textual prefix (“OCR with grounding:”) and quads encoded in <quad>…</quad>. The same cross-entropy framework is used for both content recognition and positional grounding.

5. Instruction Tuning and Conversational Protocol

Qwen-VL-Chat is distinguished by full instruction tuning for multi-turn multimodal dialogue. Training data is a mixture of LLM-generated and hand-annotated conversations, including multi-image, multi-round, localization, and text-reading examples, together with pure-text dialogues for LLM skill retention. The ChatML format formalizes dialog turns with <im_start>, <im_end>, and manages multiple images as “Picture i: <img>…</img>.”

During SFT, the objective is always cross-entropy loss on assistant-side outputs, with the vision encoder weights held fixed for stability. This facilitates robust multimodal conversational capabilities encompassing image, text, spatial, and linguistic reasoning.

6. Benchmark Performance

Qwen-VL-Chat, together with its base model Qwen-VL, achieves leading results across multiple visual-linguistic evaluation protocols:

Image Captioning and General VQA (Zero-Shot)

  • Nocaps (val): Qwen-VL-Chat 120.2
  • Flickr30K (karpathy): 81.0
  • VQAv2: 78.2
  • OKVQA: 56.6
  • GQA: 57.5
  • ScienceQA-Img: 68.2
  • VizWiz: 38.9

Text-Oriented VQA

  • TextVQA: 61.5
  • DocVQA (ANLS): 62.6
  • ChartQA: 66.3
  • AI2D: 57.7
  • OCR-VQA: 70.5

Referring Expression Comprehension

  • RefCOCO (val/testA/testB): 88.55/92.27/84.51
  • Qwen-VL-Chat achieves top-tier results on RefCOCO+/g and GRiT as well.

Few-Shot In-Context Learning

On OKVQA, VizWiz, TextVQA, and Flickr30K, Qwen-VL (7B) surpasses Flamingo-9B and demonstrably rivals Flamingo-80B performance in 1–8 shot settings.

Real-World Multimodal Dialogue

  • TouchStone (GPT-4 score): 645.2 (English), 401.2 (Chinese); best prior ≈605 (English)
  • SEED-Bench (MC VQA): 58.2 overall (Image 65.4, Video 37.8); vs. InstructBLIP 53.4
  • MME (Yes/No per-question): Perception 1487.6, Cognition 360.7; best prior ≈1212/292

A plausible implication is that the three-stage curriculum (weakly supervised pre-training, multi-task alignment, fine-grained instruction tuning) tightly integrates visual and linguistic processing, enabling flexible, high-performance multimodal reasoning (Bai et al., 2023).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Qwen-VL-Chat.