Qwen3-Max: Advanced Open-Source LLM

Updated 19 December 2025

Qwen3-Max is a large language model featuring 235B parameters with a 128-expert MoE transformer architecture for efficient and scalable reasoning.
It integrates both 'thinking' and 'non-thinking' modes, allowing users to toggle between detailed chain-of-thought and fast, concise responses.
The model employs an adaptive thinking budget to balance accuracy and latency, achieving state-of-the-art results in math, code generation, and multilingual tasks.

Qwen3-Max (Qwen3-235B-A22B) is the flagship model in the Qwen3 LLM series, representing the state-of-the-art in scalable reasoning, efficiency, and multilingual competence in open-source LLMs. It employs a 128-expert Mixture-of-Experts (MoE) transformer architecture with 235 billion parameters and introduces prompt-controllable "thinking" and "non-thinking" modes, dynamic reasoning token budgeting, and robust multilingual support covering 119 languages and dialects. Qwen3-Max attains leading results across competitive benchmarks in mathematics, code generation, agent tasks, and global language understanding while optimizing inference cost through selective expert activation and mode switching (Yang et al., 14 May 2025).

1. Architecture and Model Design

Qwen3-Max leverages an MoE transformer topology, featuring the following core characteristics:

Total parameters: 235B, deployed over 94 transformer layers.
MoE configuration: 128 experts; 8 activated per token (22B parameters actively routed for a given token; precise dense/expert weight breakdown remains undisclosed).
Attention modules: Grouped Query Attention (GQA) with 64 query and 4 key-value heads per layer.
Feed-forward modules: SwiGLU activation, RMSNorm pre-normalization, and QK-Norm in attention blocks.
Context window: Up to 128,000 tokens, supporting ABF-scaled rotary positional embeddings (RoPE) and long-context enhancements such as YARN and Dual Chunk Attention.

This design enables both parameter and compute efficiency, with inference cost determined by the number of activated experts, yielding superior throughput relative to similarly scaled dense or MoE counterparts (e.g., DeepSeek-V3, Qwen2.5-Plus) (Yang et al., 14 May 2025).

2. Unified Reasoning Modes and Control Mechanisms

A central innovation of Qwen3-Max is its integration of both "thinking" (multi-step, detailed chain-of-thought) and "non-thinking" (fast, concise answer) reasoning into a single model and interface:

Prompt flags: /think instructs the model to generate a > … block with explicit reasoning; /no_think elicits only the final output (empty <think> block).
Mode routing: Determined purely by the most recent prompt flag; no learned gating network is required.
Query template example:

<|im_start|>user
{query} /think
<|im_end|>
<|im_start|>assistant
<think>…reasoning…</think>
answer
<|im_end|>

This architecture enables dynamic switching between fast-inference and reasoned-inference modalities within the same session, directly through user prompts (Yang et al., 14 May 2025).

3. Thinking Budget and Adaptive Inference

Qwen3-Max introduces the thinking budget mechanism (Editor's term) to regulate the depth and cost of reasoning:

Budget allocation: Users set a token count $B$ restricting the reasoning (<think>) block's length for a given prompt.
Operational logic: Upon exceeding $B$ tokens, inference injects a termination phrase and the model proceeds to the answer.
Accuracy-latency tradeoff: Empirically, increased $B$ results in higher accuracy—e.g., on AIME’24 accuracy rises from ~70% at $B=4,000$ to ~86% at $B=16,000$ (Yang et al., 14 May 2025).

The mechanism provides fine-grained control over inference resources versus output quality, particularly advantageous for high-latency or real-time systems.

4. Pre-training and Fine-tuning Procedures

Qwen3-Max undergoes multi-stage training on a 36T-token multilingual corpus, including:

Stage 1 (General S1): ~30T tokens, broad-domain pre-training at sequence length 4,096.
Stage 2 (Reasoning S2): 5T tokens of STEM/coding emphasis, accelerated LR decay.
Stage 3 (Long-Context S3): Hundreds of billions of tokens at long context (16K–32K).
Data sources: Text extracted using Qwen2.5-VL, quality screened by Qwen2.5, and augmented with synthetic outputs from Qwen2.5-Math and Qwen2.5-Coder.

Post-training ablations highlight that stages such as Long-CoT SFT, Reasoning RL, Thinking Fusion, and General RL yield cumulative benchmark improvements. On-policy distillation is found to be more sample- and compute-efficient than RL for the AIME’24 task (74.4% vs. 67.6% accuracy at 1,800 GPU-hours vs. 17,920) (Yang et al., 14 May 2025).

5. Performance on Mathematical Extremal Problems and Reasoning Benchmarks

Qwen3-Max demonstrates advanced mathematical and extremal problem-solving abilities, most notably highlighted in the ExtremBench benchmark (Gao et al., 14 Oct 2025):

ExtremBench (93 mathematical optimization problems; maximization/minimization under constraints):
- Qwen3-235B-A22B-Thinking-2507 attains ~80% average accuracy, outperforming both smaller Qwen3 variants (4B-30B-Thinking: 75–80%) and other large open-source models (GPT-OSS-20B/120B: ~70%; DeepSeek-R1: 50–60%).
- Chain-of-thought (“Thinking”) variants score +10–15 percentage points above their non-Thinking counterparts.
- Some error cases persist, especially with non-smooth objectives or boundary analysis (e.g., max{p, q, r} or variables tending toward equality/zero).
Other benchmarks (selected):

| Benchmark | Qwen3-235B (Think) | DeepSeek-V3 | Llama-4-Mav | | -------------- | ------------------ | ----------- | ----------- | | MMLU (5-shot) | 87.81 | 87.19 | 85.16 | | GSM8K | 94.39 | 87.57 | 87.72 | | EvalPlus | 77.60 | 63.75 | 68.38 | | AIME’24 | 85.7 | 79.8 | – |

Qwen3-Max establishes new state-of-the-art accuracy for open-source MoE models while reducing activated parameter count and compute requirements (Yang et al., 14 May 2025).

6. Multilingual and Agentic Capabilities

Multilingual pre-training: 119 languages and dialects, significantly expanding over prior models (e.g., Qwen2.5’s 29).
Benchmark coverage: Tasks spanning Multi-IF, INCLUDE, MMMLU, MT-AIME2024, PolyMath, and MLogiQA.
Example performance (Spanish, Qwen3-235B-Thinking): Multi-IF 74.2, INCLUDE 89.1, MMMLU 86.7, MT-AIME2024 86.7 (all percentages).
Cross-family comprehension (Qwen3-32B; Belebele 80-language): Scores from 84.8% (Afro-Asiatic) to 91.3% (Uralic) (Yang et al., 14 May 2025).

These results demonstrate robust polyglot reasoning and generalization, relevant for global deployments and multilingual agent applications.

7. Comparative Analysis, Limitations, and Directions

Comparison to Qwen2.5 and DeepSeek-V3: Qwen3-Max surpasses Qwen2.5 by +1.7 MMLU, +7.7 MATH, +11.7 code eval (EvalPlus), and achieves SOTA performance with substantially reduced activated parameters versus DeepSeek-V3 (22B vs. 37B).
Extremal problem competency: "Thinking" mode, not raw scale, is the key driver for ExtremBench gains; however, even the strongest models plateau at ~80%, indicating a persistent gap in optimization-specific reasoning (Gao et al., 14 Oct 2025).
Design implications: The separation of extremal-solving from general mathematical skills is evident—AIME25 scores >90% for some models do not imply similar proficiency in constrained optimization.
Future work: Authors propose explicit curriculum learning for optimization, RL fine-tuning on extremal instances, and specialized handling of boundary/corner cases to overcome current performance plateaus (Gao et al., 14 Oct 2025).

References

Qwen3 Technical Report (Yang et al., 14 May 2025)
Max It or Miss It: Benchmarking LLM On Solving Extremal Problems (Gao et al., 14 Oct 2025)

PDF Markdown Chat (Pro)

References (2)

Qwen3 Technical Report (2025)

Max It or Miss It: Benchmarking LLM On Solving Extremal Problems (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Qwen3-Max.