Qwen Models: Evolution of Multimodal LLMs
- Qwen models are a family of large-scale Transformer-based models that excel in language, vision, audio, and multimodal tasks, incorporating state-of-the-art architectural innovations.
- They employ techniques like MoE routing, grouped query attention, and adaptive RoPE to efficiently handle long-context dependencies and improve inference performance.
- Their modular design supports specialized variants in math, vision, and audio, enabling scalable and efficient deployment across diverse research and application areas.
Qwen models are an open family of large-scale Transformer-based models originating from Alibaba’s DAMO Academy, designed to excel in language, vision, audio, and multimodal tasks. Across several generations (Qwen, Qwen2, Qwen2.5, Qwen3), the series has grown from core LLMs to highly specialized multimodal agents, integrating advanced architecture, progressive scaling, and task-specific innovations to set new open-source standards in pretraining, inference efficiency, and task generalization.
1. Model Family Evolution and Architectural Innovations
The Qwen series follows an architectural lineage built around decoder-only Transformers with incremental modifications. Early Qwen models introduced LLaMA-derivative blocks, RMSNorm pre-normalization, SwiGLU activations, and QKV-bias in self-attention. Rotary positional embeddings (RoPE) with scalable or adaptive base frequencies are central to supporting long-context extrapolation (Bai et al., 2023).
Qwen2 and later models incorporated Grouped Query Attention (GQA), improving key-value cache efficiency and facilitating larger context windows (Yang et al., 15 Jul 2024). Mixture-of-Expert (MoE) routing appears in Qwen1.5-MoE, then in advanced forms in Qwen2.5-Turbo/Plus and Qwen3-Omni (Top-K gating with per-token expert activation), enabling scalable sparse computation with competitive active parameter counts (Qwen et al., 19 Dec 2024, Xu et al., 22 Sep 2025).
Qwen2.5 introduced modularity for specialized variants, notably Qwen2.5-Math (with tool-integrated reasoning stacks), Qwen2.5-VL (dynamic-resolution ViT for vision), Qwen-Audio (Whisper-based audio encoder), and Qwen2.5-Omni (joint text, image, audio, video, streaming speech in one end-to-end model) (Qwen et al., 19 Dec 2024, Yang et al., 18 Sep 2024, Bai et al., 19 Feb 2025, Chu et al., 2023, Xu et al., 26 Mar 2025).
Qwen3 further advances architecture with QK-Norm (more stable than QKV-bias), context scaling to 128K via ABF-scaled RoPE + Dual Chunk Attention, and unified "Thinking Mode" vs. "Non-Thinking Mode" within the same model, eliminating mode switching (Yang et al., 14 May 2025). Qwen3-Omni uses a Thinker–Talker MoE split for separated text (semantic) and speech (streaming, causal ConvNet-mediated) generation (Xu et al., 22 Sep 2025). Table 1 summarizes main architectural increments:
| Generation | Core | Attention | Context Length | MoE | Multimodal | Unique Features |
|---|---|---|---|---|---|---|
| Qwen | LLaMA-based | RoPE, QKV-bias | 2K | None | None | Early RLHF/agent integration |
| Qwen2 | GQA, RMSNorm | DCA, YARN | 32K (131K inf) | Basic | Some | Improved scaling, tokens, data |
| Qwen2.5 | GQA, SwiGLU | Ultra-long, ABF | 128K–1M | Turbo/Plus | Math, VL, Audio, Omni | Progressive context expansion, modularity |
| Qwen3 | QK-Norm | FastA, DCA | 128K | 235B-22A | Omni | Thinking/Non-Thinking, budgeted |
| Qwen3-Omni | Thinker–Talker | MoE + FlashAttn | 32K+ | 30B-A3B,3B-A0.3B | Text/Image/Audio/Video | Multimodal parity, low-latency TTS |
2. Pretraining and Post-Training Methodologies
Pretraining across all generations uses massive, deduplicated corpora—evolving from 3T tokens (Qwen) to 18T (Qwen2.5) and 36T (Qwen3)—with aggressive data balancing and reward-model filtering in later releases (Qwen et al., 19 Dec 2024, Yang et al., 15 Jul 2024, Yang et al., 14 May 2025).
Long-context capabilities are realized via progressive curriculum expansion (e.g., Qwen2.5-1M’s 5-stage context schedule up to 1M tokens), adaptive RoPE base frequency, and synthetic tasks emphasizing long-range dependencies: Fill-in-the-Middle, paragraph reordering, and keyword retrieval (Yang et al., 26 Jan 2025).
Post-training includes supervised fine-tuning (SFT) on millions of high-quality instruction–response pairs covering multilingual, coding, math, and structured data scenarios. RLHF methods evolve from Proximal Policy Optimization (PPO) in Qwen, through Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), with reward models trained from human and execution feedback. Multistage RL is coupled with chain-of-thought prompts and tool-augmented SFT for specialty models, e.g., Qwen2.5-Math-Instruct (Qwen et al., 19 Dec 2024, Yang et al., 18 Sep 2024).
Distillation plays a crucial role in Qwen3, where strong dense/MoE teachers transfer "thinking" and "non-thinking" capabilities to smaller models, reducing GPU hours by an order of magnitude versus full RL, and preserving competitive performance at all scales (Yang et al., 14 May 2025).
3. Multimodal, Expert, and Downstream Model Specializations
The Qwen foundation supports a suite of domain- and modality-specific variants:
- Qwen-VL/Qwen2.5-VL: Vision-LLMs combining a ViT encoder (dynamic resolution, window attention) with Qwen LLM decoder. Later versions incorporate absolute-time MRoPE, structured data parsing (HTML from tables/forms), and agentic reasoning for UI and mobile control. Performance matches or surpasses GPT-4o and Claude 3.5 Sonnet on MMMU, document parsing, and GUI tasks (Bai et al., 2023, Bai et al., 19 Feb 2025).
- Qwen2.5-Math: Math-specialized models leveraging iterative self-improvement, reward-model-guided SFT/RL, and tool-integrated reasoning (Python code execution). State-of-the-art on MATH, GSM8K, and contest problems, with bilingual (EN/CH) support (Yang et al., 18 Sep 2024).
- Qwen-Audio: Audio-LLMs utilizing Whisper-derived encoders, hierarchical tag-injection for multi-task data (speech, sound, music, captions), and unified decoding with Qwen's LLM. Outperforms prior open multitask models on ASR, AQA, music classification, and speech emotion (Chu et al., 2023).
- Qwen2.5-Omni/Qwen3-Omni: End-to-end multimodal agents integrating text/image/audio/video understanding and streaming speech generation (discrete codec, causal convnet). Performance is non-degrading across all modalities; e.g., Qwen3-Omni matches/surpasses single-modality Qwen3 models on MMLU, MathVista, GTZAN, and AIME (Xu et al., 26 Mar 2025, Xu et al., 22 Sep 2025).
Secondary innovations include Smoothie-Qwen (post-hoc language bias smoothing for improved multilingual controllability), Qwen-LookAgain (visual reasoning with attention re-injection for hallucination reduction), and the image generation/editing foundation Qwen-Image, which introduces double-stream latent conditioning, MSRoPE for joint image/text encoding, and a curriculum learning pipeline (Ji et al., 8 Jul 2025, Chu et al., 29 May 2025, Wu et al., 4 Aug 2025).
4. Inference, Scaling, and Optimization Frameworks
Qwen models incorporate significant inference-time innovations, especially for long context and high-throughput deployment:
- Dual Chunk Attention + YaRN: Enables efficient context length extrapolation (from 256K to 1M tokens) without retraining by local/global position remapping and attention temperature scaling. Preserves local attention patterns while maintaining stable performance (Yang et al., 26 Jan 2025).
- Sparse Attention with MInference Vertical-Slash: Significantly reduces compute for long-context prefill (up to 10×). Chunked prefill/activation storage and sparsity refinement methods ensure VRAM and throughput efficiency at ultra-long sequences.
- Engine-level Enhancements (BladeLLM, vLLM, Dynamic Chunked Pipeline Parallelism, Asynchronous Generator Scheduling): Kernel-level tuning, pipeline balancing, and fully decoupled scheduler/executor/decoder chains achieve 3–7× end-to-end speedups on 1M-context workloads (Yang et al., 26 Jan 2025).
- FlashAttention, Quantization, LoRA/QLoRA, and NEFTune: Modularization for efficient fine-tuning and inference on commodity GPUs, especially for smaller variants and real-time deployment (as in the Qwen2.5 3B movie dialogue finetuning) (Gupta, 22 Feb 2025).
5. Empirical Benchmarks and Comparative Performance
Across the family, Qwen models set leading open-source scores on numerous benchmarks:
- Language Understanding/Reasoning: Qwen2.5-72B-Instruct achieves 83.1 (MATH), 86.6 (HumanEval), and 9.35 (MT-Bench), closely matching 405B-parameter Llama-3-405B-Instruct (Qwen et al., 19 Dec 2024).
- Long-Context: Qwen2.5-14B-Instruct-1M attains 100% passkey retrieval at 1M tokens, 95.7 (RULER-128K), 92.2 (RULER-128K slice), and 43.3 (LV-Eval-256K), consistently exceeding GPT-4o-mini (Yang et al., 26 Jan 2025).
- Coding and Math: Qwen2.5-Math-72B-Instruct achieves 91.6 (GSM8K), 66.8 (MATH), with TIR-driven SOTA on Olympiad-level challenges. Qwen2.5-Turbo matches/exceeds GPT-4o-mini on HumanEval and GSM8K at a fraction of infrastructure cost (Yang et al., 18 Sep 2024, Qwen et al., 19 Dec 2024).
- Vision, Audio, Multimodal: Qwen2.5-VL-72B achieves 79.8% CC-OCR parse accuracy (SOTA), state-of-the-art on document and diagram understanding. Qwen3-Omni achieves ASR WER of 1.22 (Librsipeech), GTZAN music acc 93.0%, and overall SOTA or parity with Gemini-2.5-Pro, GPT-4o, and Seed-ASR across 22/36 multimodal tasks (Bai et al., 19 Feb 2025, Xu et al., 22 Sep 2025).
Qwen3's introduction of a dynamic mode-switching and thinking budget enables adaptive balancing of latency and reasoning depth, removing the need for distinct chat and chain-of-thought models and delivering unified, flexible performance across scenarios (Yang et al., 14 May 2025).
6. Multilinguality, Controllability, and Future Directions
Qwen’s multilingual capacity scales from 30 languages (Qwen2) to 119 languages/dialects (Qwen3), measured across Multi-IF, INCLUDE, MMMLU, and MT-AIME2024. The introduction of post-hoc smoothing (Smoothie-Qwen) mitigates language confusion due to token prior imbalance, suppressing unintended output of dominant languages (over 95% decrease in unintended Chinese) with negligible loss in task accuracy (Ji et al., 8 Jul 2025). Qwen3-Omni expands spoken language production to 10 languages and understanding to 19, supporting low-latency streaming in all modalities (Xu et al., 22 Sep 2025).
Planned directions involve efficient pretraining for >1M-token context, improved block-wise streaming architectures for edge deployment, further multimodal expansion (e.g., music, tactile), context-aware smoothing, and fine-grained modality-aligned RLHF. The paradigm of self-improving, expert-specialized models (as seen in Qwen2.5-Math and vision/audio chat models) is positioned to extend to additional domains such as code, law, and medicine (Yang et al., 18 Sep 2024, Yang et al., 26 Jan 2025).
7. Summary Table: Principal Qwen Model Series
| Series | Key Features | Typical Sizes / Context | Modalities Supported | Licensing |
|---|---|---|---|---|
| Qwen (2023) | LLaMA-based, RLHF, Chat agent | 1.8B–14B / 2K | Text, agent | Apache 2.0 |
| Qwen2 (2024) | GQA, DCA, YARN, MoE | 0.5B–72B / 32–131K | Text (VL/MMLM offshoots) | Open weights |
| Qwen2.5 (2024) | ABF, modular specializations, Turbo/Plus | 0.5B–72B+ / 128K–1M | Text, Math, Code, VL, Audio, Omni | Apache 2.0, Qwen |
| Qwen3 (2025) | QK-Norm, “Thinking/Non-thinking” fusion | 0.6B–235B / 128K | Text, Omni (VL/Audio/Video TTS) | Apache 2.0 |
| Qwen3-Omni (2025) | Unified SOTA across all modalities | 30B-A3B MoE / 32K | Text, Image, Audio, Video, TTS | Apache 2.0 |
Qwen models have established a flexible ecosystem of scalable, modular, and efficiently deployable LLMs and multimodal agents, consistently achieving or approaching state-of-the-art on open evaluation suites in language, mathematics, vision, coding, audio, and robust agentic interaction (Xu et al., 22 Sep 2025, Yang et al., 14 May 2025, Qwen et al., 19 Dec 2024, Bai et al., 2023, Bai et al., 19 Feb 2025, Yang et al., 18 Sep 2024, Chu et al., 2023, Yang et al., 26 Jan 2025).