Qwen2 Series: Advanced Multimodal LLMs

Updated 21 March 2026

Qwen2 Series are a comprehensive suite of large language and multimodal models excelling in language, audio, and vision tasks.
They integrate dense, MoE, and modality-specialized variants with advanced long-context modeling and instruction-following techniques.
Open release, quantization support, and extensive benchmark evaluations drive practical research and applied AI innovations.

The Qwen2 series represents a comprehensive suite of LLMs and large multimodal models developed to address the full spectrum of language, audio, and vision tasks, encompassing dense, Mixture-of-Experts (MoE), and modality-specialized variants. Spanning parameter scales from 0.5 billion to 72 billion and integrating advancements in architecture, multilingual proficiency, long-context modeling, and instruction-following, Qwen2 models are openly released and have set new standards on tasks ranging from natural language understanding to code generation, mathematical reasoning, and multimodal perception. Systematic extensions such as Qwen2-VL (vision-language), Qwen2-Audio (audio-language), Qwen2.5-Math, Qwen2.5-Coder, Qwen2.5-VL, and Qwen2.5-Omni further elevate the series’ capabilities into complex cross-modal and specialized domains, establishing Qwen2 as a central open ecosystem for both foundational research and applied AI development (Yang et al., 2024, Qwen et al., 2024, Wang et al., 2024, Bai et al., 19 Feb 2025, Chu et al., 2024, Li et al., 27 Jan 2025, Gupta, 22 Feb 2025, Xu et al., 26 Mar 2025, Yang et al., 2024, Jiang et al., 11 May 2025).

1. Model Suite and Architectural Advances

The Qwen2 series consists of multiple model classes:

Dense (Standard) LLMs: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-72B. All deploy autoregressive Transformers with byte-level BPE tokenization (151,643 tokens), rotary position embeddings (RoPE, up to $1\times10^6$ base frequency), and grouped-query attention (GQA). Long-context capability is natively supported (context lengths up to 128,000 tokens in Qwen2.5).
Mixture-of-Experts (MoE): Qwen2-57B-A14B (64 routed, 8 shared experts; 8 out of 64 experts + 8 shared activated per token), Qwen2.5-Turbo/Plus leveraging MoE layers for sparse-activation efficiency—up to 1T total but only ~128B enabled at runtime; conditional computation ensures strong capacity at cost parity with dense models (Yang et al., 2024, Qwen et al., 2024).
Modality specialists: Qwen2-VL (vision-language), Qwen2-Audio (audio-language), and Qwen2.5-Omni (unified text/vision/audio/video and streaming response); all employ backbone plug-and-play with the core Qwen2/2.5 architecture.
Instruction-tuned and domain experts: Qwen2.5-Math (mathematics), Qwen2.5-Coder (code intelligence), and Qwen2.5-VL (advanced VL) are built via domain-centric pretraining, iterative SFT/RLHF, and reward-model sampling (Yang et al., 2024, Hui et al., 2024, Bai et al., 19 Feb 2025).

Dense and MoE models share a decoder-only architecture, with design choices such as RMSNorm (pre-normalization), SwiGLU activations, and YARN/Dual Chunk Attention for long-context extrapolation. Table 1 summarizes primary variants and capacities:

Model	Parameters	Context (Tokens)	Modality
Qwen2-0.5B	0.5B	up to 32K	Language
Qwen2-1.5B	1.5B	up to 32K	Language
Qwen2-7B	7B	up to 32K	Language
Qwen2-72B	72B	up to 32K	Language
Qwen2-57B-A14B (MoE)	57B total	up to 32K	Language
Qwen2.5-72B	72B	up to 128K	Language
Qwen2-VL, Qwen2.5-VL	2B–72B	up to 80K	Vision-Language
Qwen2-Audio	~8.2B	up to 32K	Audio-Language
Qwen2.5-Omni	7B	up to 32K	Multimodal (T, V, A, Vid)

2. Training Data, Objectives, and Methodology

Pretraining

Corpus scale: Qwen2 models are trained on 7T high-quality, filtered multilingual tokens in Qwen2, expanded to 18T tokens in Qwen2.5, encompassing English, Chinese, and 28+ other languages, as well as specialist corpora for code, mathematics, long-form, and domain tasks. Modality specialists are pre-trained on cross-modal datasets, e.g., Qwen2-Audio receives ∼66K h of speech, ∼14K h of sound, and ∼35K h of music (Chu et al., 2024).
Tokenization: Byte-level BPE (151,643 tokens) ensures coverage of multilingual and code corpora (Yang et al., 2024).
Objectives: Next-token autoregressive prediction dominates, with context extension via Dual Chunk Attention and RoPE generalization (base frequencies up to $1\times10^7$ in Turbo) (Qwen et al., 2024). For code specialists, Fill-in-the-Middle (FIM) and sentinel tokens are integrated (Hui et al., 2024).

Post-training

Supervised Fine-Tuning (SFT): Up to 1M instruction–response or domain-specific samples drawn from coding, reasoning, mathematics, multilingual, and other high-signal domains (Qwen et al., 2024).
Reinforcement Learning from Human Feedback (RLHF): Direct Preference Optimization (DPO) is consistently employed, notably in Qwen2-Audio and VL, and extended to Group Relative Policy Optimization (GRPO) in self-improving specialists such as Qwen2.5-Math (Yang et al., 2024, Chu et al., 2024, Yang et al., 2024).
Reward Model (RM): Central to math and domain experts; the RM is iteratively updated from SFT samples and then reused both in RLHF and inference-time sampling to select optimal solution chains (Yang et al., 2024).
Synthetic and bootstrapped data: Model-generated, reward-labeled, and filtered corpora systematically broaden scale and diversity for code, math, and multilingual specialists (Hui et al., 2024, Yang et al., 2024).

Qwen2 series is distinguished by modular extension to audio and vision:

Qwen2-VL and Qwen2.5-VL employ a ViT-based vision encoder (675M params), MLP token merging, and Multimodal Rotary Position Embedding (M-RoPE) to encode visual spatial and temporal position aligned with the text backbone. Native dynamic resolution and window attention support variable-resolution images and long videos without quadratic compute costs. Absolute pixel and time encodings afford precise document, chart, and video event localization (Wang et al., 2024, Bai et al., 19 Feb 2025).
Qwen2-Audio integrates a Whisper-style encoder producing mel-spectrograms and applies prompt-centric “prompt mixing” pretraining, with DPO-aligned instruction-following and dual, organically inferred audio modes: analysis and chat (Chu et al., 2024).
Qwen2.5-Omni fuses text, image, audio, and video via TMRoPE (Time-aligned Multimodal RoPE), block-wise streaming encoders and a bifurcated Thinker–Talker architecture: Thinker (LLM, text generation) and Talker (audio token generation via a sliding-window DiT decoder), jointly trained for synchronized text and streaming speech output (Xu et al., 26 Mar 2025). Encoders employ block-wise attention, and output interleaving unifies multimodal perception.

4. Specialized and Iteratively Improved Variants

Qwen2.5-Math is constructed via a three-stage self-improvement loop:
1. Pretraining on a corpus amplified with model-generated math pairs.
2. Post-training with iterative SFT and dynamically updated RMs that label, filter, and expand the dataset.
3. Final RLHF (GRPO) and RM-guided best-of-N inference. Both chain-of-thought and tool-integrated reasoning (Python execution) modes are rigorously benchmarked in English and Chinese (Yang et al., 2024).
Qwen2.5-Coder realizes state-of-the-art performance on ≥10 code and reasoning benchmarks, using scaled code-centric pretraining (5.5T tokens), advanced FIM, and repo-level objectives. Data cleaning employs hierarchical weak-classifiers and decontamination against major test sets. 70:10:20 code:math:text mixing, synthetic data validation, and both SFT and DPO fine-tuning support top performance (Hui et al., 2024).
Fine-tuned small models: Qwen2.5-0.5B/1.5B/3B, when trained with QLoRA, 4-bit quantization, and preference-based DPO, have demonstrated competitive performance in realistic, context-rich settings (e.g., movie dialogue), extending the practical reach of the small open-source LLM paradigm (Gupta, 22 Feb 2025).

5. Evaluation, Empirical Performance, and Multilinguality

Benchmarks and Comparative Results

Qwen2-72B base achieves MMLU 84.2, GPQA 37.9, HumanEval 64.6, GSM8K 89.5, BBH 82.4; Qwen2-72B-Instruct attains MT-Bench 9.1, Arena-Hard 48.1, LiveCodeBench 35.7 (Yang et al., 2024).
Cross-lingual performance is robust: Qwen2-72B scores 76–77% on aggregate M3Exam, IndoMMLU, ruMMLU, and translated MMLU; in direct human evaluation, Qwen2-72B-Instruct consistently outperforms GPT-3.5 and is competitive with GPT-4/Claude-3 across multiple languages (Yang et al., 2024).

Modality-Specific and Multimodal Results

Qwen2-Audio outperforms Gemini-1.5-pro on AIR-Bench chat, and matches state-of-the-art on ASR (1.6% WER on Librispeech), S2TT, and SER.
Qwen2-VL and Qwen2.5-VL reach SOTA accuracy on document parsing (CC-OCR 79.8 vs. 64.7/73.0), chart reasoning (ChartQA 89.5%), and diagram/layout tasks, and match GPT-4o/Sonnet on LVBench (long video) and agentic benchmarks (Wang et al., 2024, Bai et al., 19 Feb 2025).
Qwen2.5-Omni outperforms Qwen2-Audio and rivals Qwen2.5-VL on multimodal benchmarks (OmniBench average 56.1%), with competitive/leading instruction-following (MMLU, GSM8K) even via speech input and simultaneous speech output (Xu et al., 26 Mar 2025).
Specialists: Qwen2.5-Math achieves 95.9% (GSM8K) and 85.9% (MATH, 72B), SOTA on AIME/AMC24; Qwen2.5-Coder-32B reaches 65.9%/83.0% HumanEval/MBPP_full, exceeding all prior open baselines (Yang et al., 2024, Hui et al., 2024).

Distillation between Qwen2-Audio, VL, and Omni yields up to 20pp improvement in hard class accuracy, and specialist models converge towards or surpass human modal performance (e.g., audio-only Qwen2-Audio: 72.5→92.6% via distillation) (Jiang et al., 11 May 2025).

6. Quantization, Usability, and Community Access

All Qwen2 models are released with open weights (bfloat16, 8/4-bit quantized variants), Hugging Face and ModelScope deployment, and scripts for quantization, fine-tuning, and conversion (e.g., GGUF) (Yang et al., 2024). Generic training code and practical instructions (deepspeed-based SFT, bitsandbytes quantization, GGUF conversion) are provided for reproducibility and fast integration. Efficiency optimizations (memory, parameter-activation, multi-stage quant, blockwise streaming) enable deployment on cost-sensitive or latency-constrained hardware, from small edge devices (e.g., 0.5B) to large-scale inference (Qwen et al., 2024, Xu et al., 26 Mar 2025).

7. Design Principles, Extensions, and Future Directions

Key unifying principles include:

Prompt-centric pretraining: Instruction-like prompts at pretraining align the foundation model closer to instruction-tuning and generalize more robustly across modalities (Chu et al., 2024).
Modular architecture: Decoupled encoders for modalities plus shared fusion via the Qwen2 backbone (multimodal rotary embeddings, token merging) facilitate rapid extension and cross-modal transfer (Wang et al., 2024, Bai et al., 19 Feb 2025, Xu et al., 26 Mar 2025).
Iterative self-improvement: Reward models, dynamic SFT/RLHF loops, and model-ranked sampling drive specialist models (math, code) to SOTA (Yang et al., 2024).
Unified interfaces: For analysis and chat, or fast mode inference (audio, vision, text), task inference is implicit and shared across the forward pass—minimizing system-level complexity (Chu et al., 2024).

Current limitations include incomplete parity with proprietary models on the most complex multimodal reasoning (e.g., MMMU vs. GPT-4o), capped video length (e.g., ≤768 frames), and open questions about optimal dynamic resolution or cross-modal curriculum design (Wang et al., 2024). Ongoing work extends Qwen2 toward even longer contexts, new modalities (e.g., tactile), and deeper human-aligned abstractions (e.g., via cross-modal distillation, hybrid SSM–LLM clinical agents) (Jiang et al., 11 May 2025, Li et al., 27 Jan 2025).

References: (Yang et al., 2024, Qwen et al., 2024, Wang et al., 2024, Bai et al., 19 Feb 2025, Chu et al., 2024, Li et al., 27 Jan 2025, Gupta, 22 Feb 2025, Xu et al., 26 Mar 2025, Yang et al., 2024, Hui et al., 2024, Jiang et al., 11 May 2025).