Papers
Topics
Authors
Recent
2000 character limit reached

Qwen Model Series: Open-Source Multimodal LLMs

Updated 2 December 2025
  • Qwen Model Series is a family of open, foundation-grade large language and multimodal models that support text, vision, audio, and code with robust performance.
  • They leverage architectural innovations such as Grouped Query Attention, rotary positional encoding, and strong-to-weak distillation to achieve state-of-the-art results.
  • The models are released under permissive licenses, enabling researchers and practitioners to extend, quantize, and deploy them across diverse environments.

The Qwen model series is a family of open-weight, foundation-grade large language and multimodal models developed by Alibaba, covering dense, Mixture-of-Experts (MoE), and diffusion architectures for text, vision, audio, and code. Spanning parameter scales from sub-billion to hundreds of billions, Qwen models have established new state-of-the-art baselines throughout the open-source LLM ecosystem by innovating across architecture, scaling, pretraining, alignment, modality integration, and deployment. Qwen models have been made broadly available under permissive licenses, enabling the research community and industry practitioners to extend, quantize, and deploy across a range of environments.

1. Evolution and Model Lineup

The Qwen series has evolved rapidly since its introduction. The original Qwen (Qwen1.0) featured transformer decoder models in 1.8–14 B parameter classes and introduced foundational practices such as rotary positional encoding, untied embeddings, RMSNorm, and large-scale mixed text-code-multilingual pretraining. Building on this, Qwen2 expanded context length to 32 K (up to 131 K by DCA+YARN), introduced Grouped Query Attention (GQA), reinforced dense/edge and MoE variants, and grew the pretrain corpus to 7 T tokens (Yang et al., 15 Jul 2024).

Subsequent releases, Qwen2.5 and Qwen3, scaled pretraining to 18 T and 36 T tokens respectively, broadened parameterization to 235 B and context windows up to 1 million tokens (Qwen2.5-1M), and adopted strong-to-weak distillation for high-quality small models (Qwen et al., 19 Dec 2024, Yang et al., 14 May 2025, Yang et al., 26 Jan 2025). Qwen3 unified chain-of-thought and direct modes (“/think”, “/no_think”) within a shared prompt/flagging strategy. Multilingual competence expanded from 30 to 119 languages (Yang et al., 14 May 2025).

The series includes:

Generation Dense Models MoE Models Specializations
Qwen (1.0/1.5) 1.8B, 7B, 14B Chat, Code-Qwen, Math-Qwen
Qwen2 0.5B, 1.5B, 7B, 72B 57B-A14B Qwen2-Math(-Instruct)
Qwen2.5 0.5–72B Turbo, Plus (hosted) Qwen2.5-Math, Qwen2.5-Coder, VL, Audio
Qwen2.5-1M 7B, 14B (1M context) - -
Qwen3 0.6–32B 30B-A3B, 235B-A22B "Thinking", Embedding, Omni (multimodal)
Qwen3-VL/Omni (2–235B, VL/Omni) (30B-A3B, 235B-A22B) Audio, Video, GUI, Code, STEM

Multimodal branches developed in tandem: Qwen-VL, Qwen2-VL, Qwen2.5-VL, and Qwen3-VL for vision-language; Qwen-Audio and Qwen3-Omni for audio-language and all-in-one generative modeling; and Qwen-Image for foundation diffusion architectures supporting advanced image generation and editing.

2. Architectural Principles and Innovations

Core Transformer Design

All Qwen LLMs use decoder-only transformers with:

  • Grouped Query Attention (GQA) for lowered KV-cache and throughput gains.
  • Rotary Position Embeddings (RoPE), extended to 1 M base frequency for long context.
  • SwiGLU activations, QKV bias, RMSNorm (pre-norm).
  • YARN and Dual-Chunk Attention for context extrapolation.

Mixture-of-Experts (MoE)

In MoE variants, feedforward layers are partitioned into E experts. Routing is accomplished via a gating softmax:

p=softmax(G(x)),y=itopk(p)piEi(x)p = \mathrm{softmax}(G(x)), \quad y = \sum_{i \in \mathrm{top}_k(p)} p_i E_i(x)

Typically, k=8k=8 (Qwen2-57B-A14B), E=64128E=64–128, with expert initialization involving dense upcycling, param shuffling, and 50 % reinit. A load-balancing loss encourages even expert utilization (Yang et al., 15 Jul 2024, Yang et al., 14 May 2025).

Scaling and Distillation

Qwen models employ empirical scaling laws for data/model size allocation. Strong-to-weak distillation pipelines (off/off-policy) enable compact models to match larger teacher performance at reduced compute by aligning logits in both “thinking” and “no-think” modes:

LKL=ExD[KL(PT(x)PS(x))]L_{KL} = \mathbb{E}_{x \sim \mathcal{D}}[\mathrm{KL}(P_T(\cdot|x)\,\|\,P_S(\cdot|x))]

(Yang et al., 14 May 2025)

3. Training Regimes and Data

Pretraining Corpora

Pretraining aggregates web, domain, code, and math sources. Qwen2.5 incorporates advanced domain re-balancing (legal, science upsampling), and Qwen3 sources large multilingual data (extracted via Qwen2.5/2.5-VL), synthetic STEM/math/coding data, and 18–36 T tokens for dense/MoE models (Qwen et al., 19 Dec 2024, Yang et al., 14 May 2025).

Alignment and Post-Training

Multimodal Synthesis

Qwen-VL and successors employ visual and audio encoders (ViT for vision, Whisper-derived for audio), with dynamic RoPE and absolute temporal encoding to handle native image/video resolutions and hours-long video. Qwen-Image extracts high-fidelity features with both VLM-guided and VAE-based dual encoding, facilitating advanced T2I, TI2I, and I2I tasks (Wu et al., 4 Aug 2025, Bai et al., 19 Feb 2025, Bai et al., 26 Nov 2025).

4. Multimodal Extensions and Unified Modeling

Vision-Language (VL)

Qwen-VL (OpenCLIP+Qwen-7B) introduced a 3-stage pipeline: large-scale image-caption pretraining, task-focused multi-task learning (VQA, OCR, grounding), and multimodal SFT. Successors (Qwen2-VL, Qwen2.5-VL) incorporate Naive Dynamic Resolution, interleaved/absolute time MRoPE, Window Attention, Windowed Vision Transformers, and dynamic sequence packing (Wang et al., 18 Sep 2024, Bai et al., 19 Feb 2025).

Qwen3-VL pioneers 256 K-token context, interleaved-MRoPE (spectral axis interleaving for spatiotemporal encoding), DeepStack multi-level ViT fusion, and text-based time alignment for robust image/video/text co-reference (Bai et al., 26 Nov 2025).

Audio-Language & Multimodal

Qwen-Audio achieves universal audio modeling via hierarchical tagging and multitask pretraining over 30+ tasks (ASR, S2TT, AAC, SEC, AQA, etc.) in 8 languages with no task adapters, obtaining SOTA zero-shot performance on LibriSpeech, Aishell, Clotho, and others (Chu et al., 2023). Qwen3-Omni, with a Thinker-Talker MoE architecture, unifies text, image, audio, and video, enabling streaming speech with 234 ms first-packet latency and preserving unimodal SOTA on all tasks (Xu et al., 22 Sep 2025).

Image Generation

Qwen-Image (based on Qwen2.5-VL and MMDiT) balances text-conditioned semantic and reconstructive VAE streams for T2I/TI2I/I2I. Data curation includes strict filtering, text rendering augmentation (English/Chinese), and synthetic/real compositional blends, outpacing Imagen-4 and GPT-Image 1 in text fidelity and editability (Wu et al., 4 Aug 2025).

5. Empirical Performance and Benchmarks

Qwen models have consistently set state-of-the-art results among open models and match proprietary models in many domains. Qwen2-72B achieves:

Qwen2.5-72B-Instruct surpasses Llama3-70B and approaches Llama3-405B on MMLU-Pro, MATH, HumanEval (Qwen et al., 19 Dec 2024); Qwen3-235B-A22B further advances, exceeding DeepSeek-V3 and matching OpenAI-o1, Gemini2.5-Pro on challenging reasoning and multilingual tasks (Yang et al., 14 May 2025). Qwen3-VL-235B-A22B-Instruct achieves SOTA in MMLLongBench-Doc (57.0), MMMU (80.6), MathVista (85.8) (Bai et al., 26 Nov 2025).

Qwen3-Omni-30B-A3B matches or outperforms single-modal Qwen3 baselines in text, vision, and audio, and achieves SOTA ASR and TTS in large-scale multilingual evaluation, e.g., Fleurs-19 average WER 5.33 % (Xu et al., 22 Sep 2025). Qwen2.5-Math-72B-Instruct reaches MATH pass@1 66.8, beats GPT-4o on Chinese competitions (AMC, AIME) (Yang et al., 18 Sep 2024).

Embedding & Retrieval

Qwen3-Embedding-8B claims top MTEB multilingual mean task (70.58), outperforming Gemini on task and type metrics (Zhang et al., 5 Jun 2025).

6. Specialized Variants and Quantization

Code and Math Models

Code-specialized Qwen2.5-Coder (0.5–32B) is trained on 5.5 T tokens (70:20:10 code:text:math), optimized for code FIM, and achieves Python HumanEval pass@1 up to 92.7 % (32B), leading on MultiPL-E and code reasoning/repair (Hui et al., 18 Sep 2024). Math-specialized Qwen2.5-Math uses iterative reward model–guided self-improvement with synthetic pretraining and RL—a core driver for its >95% pass@1 Chain-of-Thought accuracy in English reasoning (Yang et al., 18 Sep 2024).

Quantization

Qwen2/Qwen2.5/Qwen3 models are robust to weight-only 8-bit quantization (RTN/AWQ/GPTQ), incurring ≤0.1 pt loss on MMLU. Down to 4 bits, AWQ/GPTQ incur 3–4 pt loss, while activation quantization below 8 bits or ultra-low weight quantization (<3 bit) causes catastrophic performance collapse, particularly in reasoning tasks (Zheng et al., 4 May 2025).

Distilled Reasoning: DistilQwen

DistilQwen series (slow-thinking, adaptive, RV/CD reward) offer rapid, accurate, and controller-driven dynamic CoT generation by knowledge distillation from large Qwen2.5/3 teachers, achieving >90% of 32B performance at ~¼ compute cost (Cai et al., 3 Nov 2025). Controllers are trained to balance output length and difficulty adaptively per input.

7. Open-Source Ecosystem and Community Impact

All major Qwen checkpoints (dense and MoE) and variants are released under permissive Apache 2.0 licenses, with integrated support for quantization (4/8-bit), fine-tuning, inference engines (e.g., BladeLLM for 1M-token context), and multi-platform deployment (HuggingFace, ModelScope, Alibaba Cloud Model Studio). Specialized tools enable long-context operations (DCA+YARN), high-throughput MoE inference, and pipeline/kernels tuned for modern accelerators (Yang et al., 26 Jan 2025).

Community contribution is enhanced by full releases of code, training recipes, and evaluation scripts, with academic and industrial benchmarking reinforcing model transparency and reproducibility (Yang et al., 15 Jul 2024, Yang et al., 14 May 2025, Bai et al., 26 Nov 2025).


The Qwen series exemplifies modular, scalable, and extensible open foundation models, pushing the research frontier in large language, vision, audio, and multimodal intelligence, with broad availability, technical rigor, and SOTA empirical standing across tasks and domains.

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen Model Series.