Qwen3 is presented as the latest generation of the Qwen LLM family, aiming to advance performance, efficiency, and multilingual capabilities. The series includes both dense and Mixture-of-Expert (MoE) models, ranging in size from 0.6 billion to 235 billion total parameters (with activated parameters as low as 3 billion in one MoE model). A key innovation is the integration of distinct thinking and non-thinking modes into a single model framework, allowing dynamic switching and adaptive resource allocation via a thinking budget mechanism during inference. This eliminates the need for separate models optimized for different tasks. Qwen3 expands multilingual support significantly, from 29 to 119 languages and dialects, leveraging a massive 36 trillion token pre-training dataset. All Qwen3 models are released under the Apache 2.0 license.
The architecture of Qwen3 dense models builds upon Qwen2.5, incorporating features like Grouped Query Attention (GQA), SwiGLU, Rotary Positional Embeddings (RoPE), and RMSNorm with pre-normalization. QKV-bias is removed, and QK-Norm is introduced for stable training. Dense models range from 0.6B to 32B parameters. The Qwen3 MoE models (30B-A3B and 235B-A22B) share the core dense architecture but feature 128 total experts with 8 activated per token. Unlike previous versions, they exclude shared experts and use global-batch load balancing loss for specialization. The models use Qwen's BBPE tokenizer with a vocabulary size of 151,669.
The pre-training process for Qwen3 is conducted in three stages on a diverse dataset of 36 trillion tokens covering 119 languages:
- General Stage (S1): Trained on over 30 trillion tokens (4,096 sequence length) to build general language proficiency and world knowledge across 119 languages.
- Reasoning Stage (S2): Further training on ~5 trillion higher-quality tokens (4,096 sequence length) with increased proportion of STEM, coding, reasoning, and synthetic data to enhance reasoning abilities. Learning rate decay is accelerated.
- Long Context Stage: Pre-trained on hundreds of billions of tokens (32,768 sequence length) using high-quality long context corpora. Techniques like RoPE ABF (base frequency increased from 10,000 to 1,000,000), YARN, and Dual Chunk Attention (DCA) are used to extend inference context length up to 128K tokens.
Pre-training evaluations show that Qwen3 base models achieve state-of-the-art performance compared to other open-source baselines across various benchmarks (General, Math/STEM, Coding, Multilingual).
- Qwen3-235B-A22B-Base, with fewer total (235B vs up to 671B) and activated (22B vs up to 37B) parameters, outperforms leading open-source MoE models like DeepSeek-V3 Base and Llama-4-Maverick Base on most tasks.
- Qwen3 MoE models achieve performance comparable to dense models with significantly fewer activated parameters (e.g., Qwen3-30B-A3B with 3B activated params is competitive with Qwen3-14B/32B-Base).
- Qwen3 dense models (0.6B to 32B) generally match or exceed the performance of larger Qwen2.5 dense models and other open-source models like Gemma-3 and Llama-3/4 at comparable scales, particularly strong in STEM, coding, and reasoning benchmarks.
The post-training pipeline focuses on two objectives: Thinking Control and Strong-to-Weak Distillation. Flagship models undergo a four-stage process (Figure 1):
- Long-CoT Cold Start: Initial supervised fine-tuning (SFT) on a curated dataset of math, code, logical reasoning, and STEM problems with verified answers, filtered to focus on complex problems requiring CoT.
- Reasoning RL: Reinforcement learning using GRPO on challenging query-verifier pairs to further enhance reasoning. Large batch sizes, high rollouts, and off-policy training are used, with entropy control for stability. This stage significantly improves reasoning scores like AIME.
- Thinking Mode Fusion: Continual SFT on the Reasoning RL model using a unified dataset containing both "thinking" and "non-thinking" examples. A chat template is designed using
/think
and/no_think
flags (Table 2) to enable dynamic mode switching. An empty thinking block in the non-thinking template ensures format consistency. The thinking budget mechanism emerges, allowing partial reasoning. - General RL: RL across diverse tasks (Instruction Following, Format Following, Preference Alignment, Agent Ability, Specialized Scenarios) using a sophisticated reward system combining rule-based, model-based (with/without reference) rewards to improve overall capabilities and stability.
Strong-to-Weak Distillation is applied to smaller models (0.6B to 14B dense, 30B-A3B MoE) using knowledge transfer from larger teacher models (Qwen3-32B or Qwen3-235B-A22B). This involves two phases:
- Off-policy Distillation: Distilling responses from teacher models in both thinking and non-thinking modes to impart basic reasoning and mode-switching skills.
- On-policy Distillation: Student models generate responses which are then fine-tuned by aligning their logits with the teacher's logits using KL divergence minimization. This approach significantly improves performance and training efficiency compared to direct RL (Table 8), requiring only about 1/10 the GPU hours and enhancing exploration.
Post-training evaluations cover a wide range of benchmarks, including standard tests (MMLU-Redux, C-Eval, LiveBench) and specialized tasks (IFEval, Arena-Hard, AlignBench, Creative Writing, WritingBench, MATH-500, AIME, ZebraLogic, AutoLogi, BFCL, LiveCodeBench, CodeForces, Multi-IF, INCLUDE, MMMLU, MT-AIME, PolyMath, MLogiQA).
- Qwen3-235B-A22B (Thinking) achieves state-of-the-art open-source performance in reasoning tasks (math, coding, agent) and is competitive with closed-source models like OpenAI-o1 and Gemini2.5-Pro (Table 4).
- Qwen3-235B-A22B (Non-thinking) outperforms other leading open-source non-reasoning models and is competitive with GPT-4o-2024-11-20 across most benchmarks (Table 5).
- Qwen3-32B (Thinking) is the new state-of-the-art dense reasoning model at its size, outperforming QwQ-32B and competing with OpenAI-o3-mini (medium) (Table 6).
- Qwen3-32B (Non-thinking) is remarkably performant, surpassing Qwen2.5-72B-Instruct and other baselines (Table 7).
- Smaller models (30B-A3B, 14B, 8B, 4B, 1.7B, 0.6B) consistently outperform or are competitive with larger open-source models, demonstrating the effectiveness of Strong-to-Weak Distillation in transferring capabilities efficiently (Tables 9-14).
The effectiveness of the Thinking Budget mechanism is demonstrated by performance scaling curves on math, coding, and STEM benchmarks, showing consistent improvement with increased budget (Figure 2). Ablation studies on Qwen3-32B (Table 9) show that Thinking Mode Fusion introduces initial mode switching ability and improves general/instruction-following tasks, while General RL further refines these capabilities and agent performance. Reasoning performance on complex tasks like AIME and LiveCodeBench in thinking mode slightly decreases after these later stages, suggesting a trade-off for broader versatility.
Long-context evaluations on RULER (Table 15) show Qwen3 models perform well, exceeding Qwen2.5 models of similar size in non-thinking mode up to 128K context length. Thinking mode performance slightly degrades on these retrieval-heavy tasks. Multilingual capabilities are extensively evaluated across various tasks and languages (Tables 16-23, Belebele Table 24), showcasing strong performance across a wide linguistic spectrum.
In conclusion, Qwen3 represents a significant advancement in open-source LLMs through architectural improvements, massive and diverse pre-training, integrated thinking capabilities, and efficient post-training strategies like strong-to-weak distillation. Future work will focus on improving data quality/diversity, exploring advanced architectures and long-context techniques, and enhancing agent capabilities through RL.