Qwen3 Technical Report (2505.09388v1)

Published 14 May 2025 in cs.CL

Abstract: In this work, we present Qwen3, the latest version of the Qwen model family. Qwen3 comprises a series of LLMs designed to advance performance, efficiency, and multilingual capabilities. The Qwen3 series includes models of both dense and Mixture-of-Expert (MoE) architectures, with parameter scales ranging from 0.6 to 235 billion. A key innovation in Qwen3 is the integration of thinking mode (for complex, multi-step reasoning) and non-thinking mode (for rapid, context-driven responses) into a unified framework. This eliminates the need to switch between different models--such as chat-optimized models (e.g., GPT-4o) and dedicated reasoning models (e.g., QwQ-32B)--and enables dynamic mode switching based on user queries or chat templates. Meanwhile, Qwen3 introduces a thinking budget mechanism, allowing users to allocate computational resources adaptively during inference, thereby balancing latency and performance based on task complexity. Moreover, by leveraging the knowledge from the flagship models, we significantly reduce the computational resources required to build smaller-scale models, while ensuring their highly competitive performance. Empirical evaluations demonstrate that Qwen3 achieves state-of-the-art results across diverse benchmarks, including tasks in code generation, mathematical reasoning, agent tasks, etc., competitive against larger MoE models and proprietary models. Compared to its predecessor Qwen2.5, Qwen3 expands multilingual support from 29 to 119 languages and dialects, enhancing global accessibility through improved cross-lingual understanding and generation capabilities. To facilitate reproducibility and community-driven research and development, all Qwen3 models are publicly accessible under Apache 2.0.

PDF Abstract

Qwen3 is presented as the latest generation of the Qwen LLM family, aiming to advance performance, efficiency, and multilingual capabilities. The series includes both dense and Mixture-of-Expert (MoE) models, ranging in size from 0.6 billion to 235 billion total parameters (with activated parameters as low as 3 billion in one MoE model). A key innovation is the integration of distinct thinking and non-thinking modes into a single model framework, allowing dynamic switching and adaptive resource allocation via a thinking budget mechanism during inference. This eliminates the need for separate models optimized for different tasks. Qwen3 expands multilingual support significantly, from 29 to 119 languages and dialects, leveraging a massive 36 trillion token pre-training dataset. All Qwen3 models are released under the Apache 2.0 license.

The architecture of Qwen3 dense models builds upon Qwen2.5, incorporating features like Grouped Query Attention (GQA), SwiGLU, Rotary Positional Embeddings (RoPE), and RMSNorm with pre-normalization. QKV-bias is removed, and QK-Norm is introduced for stable training. Dense models range from 0.6B to 32B parameters. The Qwen3 MoE models (30B-A3B and 235B-A22B) share the core dense architecture but feature 128 total experts with 8 activated per token. Unlike previous versions, they exclude shared experts and use global-batch load balancing loss for specialization. The models use Qwen's BBPE tokenizer with a vocabulary size of 151,669.

The pre-training process for Qwen3 is conducted in three stages on a diverse dataset of 36 trillion tokens covering 119 languages:

General Stage (S1): Trained on over 30 trillion tokens (4,096 sequence length) to build general language proficiency and world knowledge across 119 languages.
Reasoning Stage (S2): Further training on ~5 trillion higher-quality tokens (4,096 sequence length) with increased proportion of STEM, coding, reasoning, and synthetic data to enhance reasoning abilities. Learning rate decay is accelerated.
Long Context Stage: Pre-trained on hundreds of billions of tokens (32,768 sequence length) using high-quality long context corpora. Techniques like RoPE ABF (base frequency increased from 10,000 to 1,000,000), YARN, and Dual Chunk Attention (DCA) are used to extend inference context length up to 128K tokens.

Pre-training evaluations show that Qwen3 base models achieve state-of-the-art performance compared to other open-source baselines across various benchmarks (General, Math/STEM, Coding, Multilingual).

Qwen3-235B-A22B-Base, with fewer total (235B vs up to 671B) and activated (22B vs up to 37B) parameters, outperforms leading open-source MoE models like DeepSeek-V3 Base and Llama-4-Maverick Base on most tasks.
Qwen3 MoE models achieve performance comparable to dense models with significantly fewer activated parameters (e.g., Qwen3-30B-A3B with 3B activated params is competitive with Qwen3-14B/32B-Base).
Qwen3 dense models (0.6B to 32B) generally match or exceed the performance of larger Qwen2.5 dense models and other open-source models like Gemma-3 and Llama-3/4 at comparable scales, particularly strong in STEM, coding, and reasoning benchmarks.

The post-training pipeline focuses on two objectives: Thinking Control and Strong-to-Weak Distillation. Flagship models undergo a four-stage process (Figure 1):

Long-CoT Cold Start: Initial supervised fine-tuning (SFT) on a curated dataset of math, code, logical reasoning, and STEM problems with verified answers, filtered to focus on complex problems requiring CoT.
Reasoning RL: Reinforcement learning using GRPO on challenging query-verifier pairs to further enhance reasoning. Large batch sizes, high rollouts, and off-policy training are used, with entropy control for stability. This stage significantly improves reasoning scores like AIME.
Thinking Mode Fusion: Continual SFT on the Reasoning RL model using a unified dataset containing both "thinking" and "non-thinking" examples. A chat template is designed using /think and /no_think flags (Table 2) to enable dynamic mode switching. An empty thinking block in the non-thinking template ensures format consistency. The thinking budget mechanism emerges, allowing partial reasoning.
General RL: RL across diverse tasks (Instruction Following, Format Following, Preference Alignment, Agent Ability, Specialized Scenarios) using a sophisticated reward system combining rule-based, model-based (with/without reference) rewards to improve overall capabilities and stability.

Strong-to-Weak Distillation is applied to smaller models (0.6B to 14B dense, 30B-A3B MoE) using knowledge transfer from larger teacher models (Qwen3-32B or Qwen3-235B-A22B). This involves two phases:

Off-policy Distillation: Distilling responses from teacher models in both thinking and non-thinking modes to impart basic reasoning and mode-switching skills.
On-policy Distillation: Student models generate responses which are then fine-tuned by aligning their logits with the teacher's logits using KL divergence minimization. This approach significantly improves performance and training efficiency compared to direct RL (Table 8), requiring only about 1/10 the GPU hours and enhancing exploration.

Post-training evaluations cover a wide range of benchmarks, including standard tests (MMLU-Redux, C-Eval, LiveBench) and specialized tasks (IFEval, Arena-Hard, AlignBench, Creative Writing, WritingBench, MATH-500, AIME, ZebraLogic, AutoLogi, BFCL, LiveCodeBench, CodeForces, Multi-IF, INCLUDE, MMMLU, MT-AIME, PolyMath, MLogiQA).

Qwen3-235B-A22B (Thinking) achieves state-of-the-art open-source performance in reasoning tasks (math, coding, agent) and is competitive with closed-source models like OpenAI-o1 and Gemini2.5-Pro (Table 4).
Qwen3-235B-A22B (Non-thinking) outperforms other leading open-source non-reasoning models and is competitive with GPT-4o-2024-11-20 across most benchmarks (Table 5).
Qwen3-32B (Thinking) is the new state-of-the-art dense reasoning model at its size, outperforming QwQ-32B and competing with OpenAI-o3-mini (medium) (Table 6).
Qwen3-32B (Non-thinking) is remarkably performant, surpassing Qwen2.5-72B-Instruct and other baselines (Table 7).
Smaller models (30B-A3B, 14B, 8B, 4B, 1.7B, 0.6B) consistently outperform or are competitive with larger open-source models, demonstrating the effectiveness of Strong-to-Weak Distillation in transferring capabilities efficiently (Tables 9-14).

The effectiveness of the Thinking Budget mechanism is demonstrated by performance scaling curves on math, coding, and STEM benchmarks, showing consistent improvement with increased budget (Figure 2). Ablation studies on Qwen3-32B (Table 9) show that Thinking Mode Fusion introduces initial mode switching ability and improves general/instruction-following tasks, while General RL further refines these capabilities and agent performance. Reasoning performance on complex tasks like AIME and LiveCodeBench in thinking mode slightly decreases after these later stages, suggesting a trade-off for broader versatility.

Long-context evaluations on RULER (Table 15) show Qwen3 models perform well, exceeding Qwen2.5 models of similar size in non-thinking mode up to 128K context length. Thinking mode performance slightly degrades on these retrieval-heavy tasks. Multilingual capabilities are extensively evaluated across various tasks and languages (Tables 16-23, Belebele Table 24), showcasing strong performance across a wide linguistic spectrum.

In conclusion, Qwen3 represents a significant advancement in open-source LLMs through architectural improvements, massive and diverse pre-training, integrated thinking capabilities, and efficient post-training strategies like strong-to-weak distillation. Future work will focus on improving data quality/diversity, exploring advanced architectures and long-context techniques, and enhancing agent capabilities through RL.

PDF Markdown Bookmark Chat (Pro)

Authors (60)

An Yang (32 papers)
Anfeng Li (1 paper)
Baosong Yang (57 papers)
Beichen Zhang (27 papers)
Binyuan Hui (57 papers)
Bo Zheng (205 papers)
Bowen Yu (89 papers)
Chang Gao (54 papers)
Chengen Huang (4 papers)
Chenxu Lv (4 papers)
Chujie Zheng (35 papers)
Dayiheng Liu (75 papers)
Fan Zhou (111 papers)
Fei Huang (409 papers)
Feng Hu (25 papers)
Hao Ge (49 papers)
Haoran Wei (55 papers)
Huan Lin (55 papers)
Jialong Tang (17 papers)
Jian Yang (505 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/ChujieZheng/status/1924298258651972061

https://twitter.com/fahimintech/status/1930597454325383451

YouTube

Show All Videos