Qwen3-30B & Qwen3-235B MoE Models
- Qwen3 models are large-scale Mixture-of-Experts Transformers featuring unified thinking/non-thinking modes and 128 experts per layer for efficient inference.
- They incorporate adapter modules and advanced routing to support ultra-long context windows (up to 128K tokens) while reducing computational costs.
- Empirical benchmarks demonstrate state-of-the-art performance in multilingual, reasoning, and coding tasks with notable stability and latency controls.
Qwen3-30B and Qwen3-235B are large-scale, Mixture-of-Experts (MoE) Transformer LLMs that represent the flagship open-source offerings in the Qwen3 series. These models are architected for efficiency, multilingual breadth, and dual-mode operation—combining explicit chain-of-thought (“thinking”) reasoning with direct response (“non-thinking”) generation under a unified control mechanism. Both models expand the MoE paradigm to 128 experts per layer, support extended context windows (up to 128K tokens in some variants), and introduce mode-switching and thinking-budget controls to balance latency and reasoning quality. They set state-of-the-art benchmarks in multilingual, reasoning, code, and agentic inference tasks within the open-source LLM ecosystem, while enabling substantial inference cost reductions compared to dense-model counterparts (Yang et al., 14 May 2025, Du et al., 26 Jul 2025, Curtò et al., 30 Oct 2025, Oncescu et al., 4 Nov 2025).
1. Model Architectures and Mixture-of-Experts Design
Qwen3-30B-A3B and Qwen3-235B-A22B employ a decoder-only Transformer structure with all feed-forward sublayers in every Transformer block replaced by a MoE sublayer. Each MoE layer contains 128 experts; per token, the top-8 experts by routing score are selected. This high-degree sparsity enables large parameter counts with active parameter footprints much lower than dense models of similar effective size.
Architectural Details:
| Model | Layers | Attention Heads | Experts / Layer | Active Experts | MoE Dim per Expert | Total Params | Context Window |
|---|---|---|---|---|---|---|---|
| Qwen3-30B-A3B | 48 | 32Q / 4KV | 128 | 8 | 2048/768 SwiGLU | ≈ 30 B | 128 K |
| Qwen3-235B-A22B | 94/96 | 64Q / 4KV | 128 | 8 | 4096/1536 SwiGLU | ≈ 235 B | 128 K |
Key MoE Implementation Features:
- Routing / Gating: Linear map followed by softmax, then top-8 gating per token—i.e., for token representation , select experts with highest and compute , with a two-layer SwiGLU expert.
- Global Load-Balancing: Global-batch regularization where is the batch fraction routed to expert , ensuring uniform expert utilization (Yang et al., 14 May 2025).
- Adapters: The “A3B” and “A22B” suffixes denote lightweight adapter modules (≈3B and ≈22B trainable parameters, respectively) inserted in each block. These adapters allow efficient context extension (≥128K tokens) and facilitate fine-tuning or reinforcement learning without modifying all core weights (Du et al., 26 Jul 2025).
This approach (Editors' term: “full-depth blockwise MoE”) provides strong parameter scaling with manageable hardware demand, as only 8/128 experts are used per token per layer, dramatically reducing active memory and FLOPs.
2. Unified Reasoning and Latency Modes
Qwen3 introduces an explicit mode control that enables unified “thinking” (multi-step reasoning, e.g., chain-of-thought) and “non-thinking” (direct answer) behavior within a single model.
- Chat-Template Mode Flags: User messages are tagged with
/think(reasoning) or/no_think(non-reasoning). Model outputs are correspondingly wrapped with> ..., with the final answer following (Yang et al., 14 May 2025). - Thinking Budgets: User-supplied budgets (token count inside
<think>block) restrict the number of reasoning tokens. Once is exceeded, the model forcibly emits the answer. Empirically, performance on reasoning tasks (math/coding/STEM) scales smoothly with (e.g., ), allowing direct control over latency–quality trade-offs. - Supervised Fine-Tuning: During SFT (“Stage 3”), samples are mixed between thinking and non-thinking modes using these template flags, leading to robust in-context switching and hybrid operation.
This dual-mode operation eliminates the need for separate chat and reasoning models, and the thinking budget mechanism affords fine-grained runtime controllability of cost versus quality.
3. Evaluation Metrics, Benchmarks, and Reasoning Performance
The Qwen3-30B and Qwen3-235B models have been evaluated extensively under both controlled reasoning benchmarks and downstream task settings.
Summary of Core Results:
| Model | AIME’24/’25 Reasoning (think) | Coding (LiveCodeBench, think) | MMLU Redux (think) | Multilingual (MT-AIME2024, think, 55 langs) |
|---|---|---|---|---|
| Qwen3-235B-A22B | 85.7% / 81.5% | 70.7% | 92.7% | 80.8% |
| Qwen3-30B-A3B | n/a / n/a | 62.6% | 89.5% | n/a |
- For final-answer (“step-by-step to answer”) correctness, Qwen3-235B-A22B leads its 30B counterpart by ~4–7% absolute across most reasoning and coding benchmarks (Yang et al., 14 May 2025).
- In infrastructure-agnostic benchmarks (19/79-problem sets, across multiple domains), Qwen3-235B achieves average final-score similarity 0.529 (19-problem) and 0.487 (79-problem) versus Qwen3-30B’s 0.514 and 0.477 respectively. Notably, Qwen3-30B achieves higher step-accuracy (average semantic similarity of reasoning steps) at scale: 0.513 vs 0.488, reflecting more coherent reasoning chains in some settings (Curtò et al., 30 Oct 2025).
- Both models exhibit exceptional consistency (run-to-run stddev 0.013–0.017), three times better than most peer models, making them suited for applications requiring stable behavior (Curtò et al., 30 Oct 2025).
- Step-accuracy and final correctness are only weakly correlated (), indicating more steps does not guarantee correctness—a feature relevant for transparency and audit scenarios.
Parameter Efficiency Paradox: The marginal returns to increased parameter count diminish sharply above 70B; e.g., the 14B dense Phi-4-mini can outperform larger sparse MoE models on some reasoning tasks (Curtò et al., 30 Oct 2025).
4. Efficiency, Latency, and Expert Routing Innovations
The MoE design in Qwen3-30B/235B reduces computational load by activating only a small expert subset per token. Additional efficiency gains are realized via opportunistic, batch-aware expert routing (OEA) (Oncescu et al., 4 Nov 2025):
- Memory-Bound Latency Model: For batch size and active experts per token, the expected number of distinct experts is
For , , : .
- Roofline Latency: Each expert’s fetch latency dominates: -, with for weight load and per-token compute.
- Batch-Aware Routing (OEA): Inference-time “piggyback” routing assigns baseline experts per token (), then opportunistically fills the remainder among already-selected experts across the batch, reducing distinct experts fetched per batch without retraining.
- Empirical Gains (at ):
- Qwen3-30B: OEA reduces MoE layer latency by 39% (from 175.7 μs to 106.8 μs at ).
- Qwen3-235B: 15% reduction (from 119.4 μs to 101.4 μs at ).
- No statistically significant loss in main benchmarks (AIME, GPQA, LiveCodeBench, MATH500) up to moderate pruning. Quality declines only with aggressive reduction of active experts per token.
Additional impact stems from the sparse MoE structure yielding – lower FLOPs and memory costs per token compared to dense models of equal size, and up to higher throughput on fixed hardware in production deployments (Yang et al., 14 May 2025, Oncescu et al., 4 Nov 2025).
5. Reinforcement Learning for Ultra-Long Reasoning
Qwen3-30B-A3B and Qwen3-235B-A22B support fine-tuning with reinforcement learning for ultra-long reasoning chains (UloRL) (Du et al., 26 Jul 2025):
- Segmented RL Rollouts: Rather than generating 128K-token episodes in full, RL training divides outputs into segments, updating on completed segments in parallel for 2.06× speedup (e.g., 1240s → 601s per step at ).
- Dynamic Masking of Mastered Positive Tokens (DMMPTs): When the model reliably predicts positive tokens (, ) and entropy falls below , these tokens are dynamically masked to prevent entropy collapse during RL fine-tuning.
- Results: UloRL with DMMPTs elevates Qwen3-30B-A3B from 70.9% to 82.8% (AIME-2025), and with token window extension (Yarn) to 140K reaches 85.1%, surpassing even Qwen3-235B-A22B at 81.5%. For ultra-long tasks, adapters and segment-level RL drive more effective learning than scale alone.
- Implications: Adapter-tuning allows for outsized performance improvements without retraining the full parameter core, and entropy stabilization prevents premature overfitting of positive token spans. However, large memory requirements (>128K context) and extra verifier models are needed.
6. Multilingual Generalization
Qwen3-235B-A22B is pre-trained on 36 trillion tokens in 119 languages/dialects (Qwen2.5: 29), supporting broad zero-shot and cross-lingual capacity (Yang et al., 14 May 2025). Key outcomes include:
- MMMLU (14 languages): 86.7% (base), 84.3% (post-fine-tune)
- MGSM (8-language math CoT): 83.53%
- MT-AIME2024 (55 langs): 80.8% (post-training, thinking mode)
- INCLUDE (44-language regional): 73.46%
- Polymath (18 langs): 54.7%
- Benchmarks such as Flores-101 and Belebele (122 variants) confirm strong Indo-European and broader international performance.
This indicates high-quality multilingual pretraining and beneficial knowledge transfer from high-resource to low-resource languages.
7. Limitations and Recommended Use Cases
Both Qwen3-30B and Qwen3-235B, while setting open-source state-of-the-art in multi-domain “thinking” tasks, exhibit several nuanced characteristics:
- Consistency: Outstanding stability (stddev 0.013–0.017) across runs and domains, superior to nearly all contemporaries (Curtò et al., 30 Oct 2025).
- Transparency: Weak link between step-accuracy and final correctness (); not optimal for settings demanding highly interpretable chain-of-thoughts (e.g., education, audits)—whereas models like DeepSeek-R1 (step-accuracy 0.716) have an edge.
- Resource Footprint: Ultra-long output RL and large context operation require extensive GPU memory and runtime verifier models.
- Scaling Law: Additional parameter count above 70B yields diminished returns; careful data curation and instruction tuning dominate at high scale.
Practical Suitability: These models suit production environments demanding high repeatability and latency-tunable reasoning, less so scenarios prioritizing stepwise explainability or ultra-interpretable proof chains (Curtò et al., 30 Oct 2025, Du et al., 26 Jul 2025).
Collectively, Qwen3-30B-A3B and Qwen3-235B-A22B advance the MoE LLM paradigm by combining unified reasoning/direct modes, expert-activating sparsity, robust multilingual coverage, adapter-driven extensibility, and batch-optimized inference—all with consistent, reproducible outcomes across infrastructure and benchmarks.