Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Qwen3-30B & Qwen3-235B MoE Models

Updated 9 November 2025
  • Qwen3 models are large-scale Mixture-of-Experts Transformers featuring unified thinking/non-thinking modes and 128 experts per layer for efficient inference.
  • They incorporate adapter modules and advanced routing to support ultra-long context windows (up to 128K tokens) while reducing computational costs.
  • Empirical benchmarks demonstrate state-of-the-art performance in multilingual, reasoning, and coding tasks with notable stability and latency controls.

Qwen3-30B and Qwen3-235B are large-scale, Mixture-of-Experts (MoE) Transformer LLMs that represent the flagship open-source offerings in the Qwen3 series. These models are architected for efficiency, multilingual breadth, and dual-mode operation—combining explicit chain-of-thought (“thinking”) reasoning with direct response (“non-thinking”) generation under a unified control mechanism. Both models expand the MoE paradigm to 128 experts per layer, support extended context windows (up to 128K tokens in some variants), and introduce mode-switching and thinking-budget controls to balance latency and reasoning quality. They set state-of-the-art benchmarks in multilingual, reasoning, code, and agentic inference tasks within the open-source LLM ecosystem, while enabling substantial inference cost reductions compared to dense-model counterparts (Yang et al., 14 May 2025, Du et al., 26 Jul 2025, Curtò et al., 30 Oct 2025, Oncescu et al., 4 Nov 2025).

1. Model Architectures and Mixture-of-Experts Design

Qwen3-30B-A3B and Qwen3-235B-A22B employ a decoder-only Transformer structure with all feed-forward sublayers in every Transformer block replaced by a MoE sublayer. Each MoE layer contains 128 experts; per token, the top-8 experts by routing score are selected. This high-degree sparsity enables large parameter counts with active parameter footprints much lower than dense models of similar effective size.

Architectural Details:

Model Layers Attention Heads Experts / Layer Active Experts MoE Dim per Expert Total Params Context Window
Qwen3-30B-A3B 48 32Q / 4KV 128 8 2048/768 SwiGLU ≈ 30 B 128 K
Qwen3-235B-A22B 94/96 64Q / 4KV 128 8 4096/1536 SwiGLU ≈ 235 B 128 K

Key MoE Implementation Features:

  • Routing / Gating: Linear map g=Wxg = W x followed by softmax, then top-8 gating per token—i.e., for token representation xx, select k=8k=8 experts SS with highest pi=softmax(g)ip_i = \text{softmax}(g)_i and compute moe(x)=iSpijSpjEi(x)moe(x) = \sum_{i\in S} \frac{p_i}{\sum_{j\in S} p_j} E_i(x), with EiE_i a two-layer SwiGLU expert.
  • Global Load-Balancing: Global-batch regularization Lbalance=iLoadi2L_{balance} = \sum_i \text{Load}_i^2 where Loadi\text{Load}_i is the batch fraction routed to expert ii, ensuring uniform expert utilization (Yang et al., 14 May 2025).
  • Adapters: The “A3B” and “A22B” suffixes denote lightweight adapter modules (≈3B and ≈22B trainable parameters, respectively) inserted in each block. These adapters allow efficient context extension (≥128K tokens) and facilitate fine-tuning or reinforcement learning without modifying all core weights (Du et al., 26 Jul 2025).

This approach (Editors' term: “full-depth blockwise MoE”) provides strong parameter scaling with manageable hardware demand, as only 8/128 experts are used per token per layer, dramatically reducing active memory and FLOPs.

2. Unified Reasoning and Latency Modes

Qwen3 introduces an explicit mode control that enables unified “thinking” (multi-step reasoning, e.g., chain-of-thought) and “non-thinking” (direct answer) behavior within a single model.

  • Chat-Template Mode Flags: User messages are tagged with /think (reasoning) or /no_think (non-reasoning). Model outputs are correspondingly wrapped with > ..., with the final answer following (Yang et al., 14 May 2025).
  • Thinking Budgets: User-supplied budgets BB (token count inside <think> block) restrict the number of reasoning tokens. Once BB is exceeded, the model forcibly emits the answer. Empirically, performance on reasoning tasks (math/coding/STEM) scales smoothly with BB (e.g., B{2K,4K,8K,16K}B \in \{2K, 4K, 8K, 16K\}), allowing direct control over latency–quality trade-offs.
  • Supervised Fine-Tuning: During SFT (“Stage 3”), samples are mixed between thinking and non-thinking modes using these template flags, leading to robust in-context switching and hybrid operation.

This dual-mode operation eliminates the need for separate chat and reasoning models, and the thinking budget mechanism affords fine-grained runtime controllability of cost versus quality.

3. Evaluation Metrics, Benchmarks, and Reasoning Performance

The Qwen3-30B and Qwen3-235B models have been evaluated extensively under both controlled reasoning benchmarks and downstream task settings.

Summary of Core Results:

Model AIME’24/’25 Reasoning (think) Coding (LiveCodeBench, think) MMLU Redux (think) Multilingual (MT-AIME2024, think, 55 langs)
Qwen3-235B-A22B 85.7% / 81.5% 70.7% 92.7% 80.8%
Qwen3-30B-A3B n/a / n/a 62.6% 89.5% n/a
  • For final-answer (“step-by-step to answer”) correctness, Qwen3-235B-A22B leads its 30B counterpart by ~4–7% absolute across most reasoning and coding benchmarks (Yang et al., 14 May 2025).
  • In infrastructure-agnostic benchmarks (19/79-problem sets, across multiple domains), Qwen3-235B achieves average final-score similarity 0.529 (19-problem) and 0.487 (79-problem) versus Qwen3-30B’s 0.514 and 0.477 respectively. Notably, Qwen3-30B achieves higher step-accuracy (average semantic similarity of reasoning steps) at scale: 0.513 vs 0.488, reflecting more coherent reasoning chains in some settings (Curtò et al., 30 Oct 2025).
  • Both models exhibit exceptional consistency (run-to-run stddev 0.013–0.017), three times better than most peer models, making them suited for applications requiring stable behavior (Curtò et al., 30 Oct 2025).
  • Step-accuracy and final correctness are only weakly correlated (r0.095r \approx 0.095), indicating more steps does not guarantee correctness—a feature relevant for transparency and audit scenarios.

Parameter Efficiency Paradox: The marginal returns to increased parameter count diminish sharply above 70B; e.g., the 14B dense Phi-4-mini can outperform larger sparse MoE models on some reasoning tasks (Curtò et al., 30 Oct 2025).

4. Efficiency, Latency, and Expert Routing Innovations

The MoE design in Qwen3-30B/235B reduces computational load by activating only a small expert subset per token. Additional efficiency gains are realized via opportunistic, batch-aware expert routing (OEA) (Oncescu et al., 4 Nov 2025):

  • Memory-Bound Latency Model: For batch size B=16B=16 and k=8k=8 active experts per token, the expected number of distinct experts T(B)T(B) is

T(B)=N[1(1k/N)B]T(B) = N \cdot [1 - (1 - k/N)^B]

For N=128N=128, k=8k=8, B=16B=16: T82T\approx 82.

  • Roofline Latency: Each expert’s fetch latency dominates: MoEMoE-latency=bT+a(Bk)latency = b\cdot T + a\cdot(Bk), with bb for weight load and aa per-token compute.
  • Batch-Aware Routing (OEA): Inference-time “piggyback” routing assigns baseline experts per token (k0kk_0 \leq k), then opportunistically fills the remainder among already-selected experts across the batch, reducing distinct experts fetched per batch without retraining.
  • Empirical Gains (at B=16B=16):
    • Qwen3-30B: OEA reduces MoE layer latency by 39% (from 175.7 μs to 106.8 μs at k0=3k_0=3).
    • Qwen3-235B: 15% reduction (from 119.4 μs to 101.4 μs at k0=5k_0=5).
    • No statistically significant loss in main benchmarks (AIME, GPQA, LiveCodeBench, MATH500) up to moderate k0k_0 pruning. Quality declines only with aggressive reduction of active experts per token.

Additional impact stems from the sparse MoE structure yielding 3×\sim3\times4×4\times lower FLOPs and memory costs per token compared to dense models of equal size, and up to 2×\sim2\times higher throughput on fixed hardware in production deployments (Yang et al., 14 May 2025, Oncescu et al., 4 Nov 2025).

5. Reinforcement Learning for Ultra-Long Reasoning

Qwen3-30B-A3B and Qwen3-235B-A22B support fine-tuning with reinforcement learning for ultra-long reasoning chains (UloRL) (Du et al., 26 Jul 2025):

  • Segmented RL Rollouts: Rather than generating 128K-token episodes in full, RL training divides outputs into MM segments, updating on completed segments in parallel for 2.06× speedup (e.g., 1240s → 601s per step at M=4M=4).
  • Dynamic Masking of Mastered Positive Tokens (DMMPTs): When the model reliably predicts positive tokens (pθ(tc)τp_\theta(t|c)\geq\tau, τ=0.99\tau=0.99) and entropy falls below σ=0.2\sigma=0.2, these tokens are dynamically masked to prevent entropy collapse during RL fine-tuning.
  • Results: UloRL with DMMPTs elevates Qwen3-30B-A3B from 70.9% to 82.8% (AIME-2025), and with token window extension (Yarn) to 140K reaches 85.1%, surpassing even Qwen3-235B-A22B at 81.5%. For ultra-long tasks, adapters and segment-level RL drive more effective learning than scale alone.
  • Implications: Adapter-tuning allows for outsized performance improvements without retraining the full parameter core, and entropy stabilization prevents premature overfitting of positive token spans. However, large memory requirements (>128K context) and extra verifier models are needed.

6. Multilingual Generalization

Qwen3-235B-A22B is pre-trained on 36 trillion tokens in 119 languages/dialects (Qwen2.5: 29), supporting broad zero-shot and cross-lingual capacity (Yang et al., 14 May 2025). Key outcomes include:

  • MMMLU (14 languages): 86.7% (base), 84.3% (post-fine-tune)
  • MGSM (8-language math CoT): 83.53%
  • MT-AIME2024 (55 langs): 80.8% (post-training, thinking mode)
  • INCLUDE (44-language regional): 73.46%
  • Polymath (18 langs): 54.7%
  • Benchmarks such as Flores-101 and Belebele (122 variants) confirm strong Indo-European and broader international performance.

This indicates high-quality multilingual pretraining and beneficial knowledge transfer from high-resource to low-resource languages.

Both Qwen3-30B and Qwen3-235B, while setting open-source state-of-the-art in multi-domain “thinking” tasks, exhibit several nuanced characteristics:

  • Consistency: Outstanding stability (stddev 0.013–0.017) across runs and domains, superior to nearly all contemporaries (Curtò et al., 30 Oct 2025).
  • Transparency: Weak link between step-accuracy and final correctness (r0.095r\approx0.095); not optimal for settings demanding highly interpretable chain-of-thoughts (e.g., education, audits)—whereas models like DeepSeek-R1 (step-accuracy 0.716) have an edge.
  • Resource Footprint: Ultra-long output RL and large context operation require extensive GPU memory and runtime verifier models.
  • Scaling Law: Additional parameter count above 70B yields diminished returns; careful data curation and instruction tuning dominate at high scale.

Practical Suitability: These models suit production environments demanding high repeatability and latency-tunable reasoning, less so scenarios prioritizing stepwise explainability or ultra-interpretable proof chains (Curtò et al., 30 Oct 2025, Du et al., 26 Jul 2025).


Collectively, Qwen3-30B-A3B and Qwen3-235B-A22B advance the MoE LLM paradigm by combining unified reasoning/direct modes, expert-activating sparsity, robust multilingual coverage, adapter-driven extensibility, and batch-optimized inference—all with consistent, reproducible outcomes across infrastructure and benchmarks.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Qwen3-30B and Qwen3-235B Models.