Qwen3-Max: Flagship MoE Transformer

Updated 17 March 2026

Qwen3-Max is a large-scale Mixture-of-Experts transformer model with 235 billion parameters and 22 billion active per-token, designed for high-capacity reasoning and rapid responses.
It employs a 94-layer sparse MoE architecture featuring grouped query attention and dual-mode operation to balance chain-of-thought reasoning with instant answer modes.
The model supports robust multilingual and multimodal tasks across fields like medicine, code intelligence, and agentic reasoning while significantly improving memory and FLOPs efficiency.

Qwen3-Max (Qwen3-235B-A22B) is a large-scale, open-source Mixture-of-Experts (MoE) transformer LLM developed as the flagship model of the Qwen3 family. With approximately 235 billion parameters and a per-token activation of 22 billion parameters, Qwen3-Max is engineered to unify high-capacity reasoning (“thinking mode”), rapid context-driven responses (“non-thinking mode”), ultra-long context support, and strong multilingual and multimodal capabilities. This model is a focal point for research on scalable LLM design, efficient MoE architectures, robust alignment, and high-stakes applications in domains such as medicine, code intelligence, and agentic reasoning.

1. Model Architecture and Design

Qwen3-235B-A22B (“Qwen3-Max”) is constructed as a 94-layer transformer leveraging a sparse MoE variant: every second feed-forward block is replaced with an MoE block containing 128 expert MLPs (each with standard SwiGLU structure and hidden dimension $d_{\mathrm{ff}}$ ). At inference, $k=8$ experts are activated per token, yielding 22 billion active parameters per forward pass (Yang et al., 14 May 2025, Bai et al., 26 Nov 2025). Grouped query attention is applied (64 query-key heads, 4 value heads). The maximum supported context length is 128,000 tokens for the base model and up to 256,000 tokens within the Qwen3-VL vision-language extension.

Key design choices include the integration of “thinking mode” (for chain-of-thought and multi-step reasoning) and “non-thinking mode” (for rapid, direct answers) in a unified checkpoint. Mode selection is controlled via query flags or internal gating. The “thinking budget” mechanism allows adjustable allocation of tokens for in-context reasoning, tuning speed-accuracy tradeoffs.

Parameter breakdown is as follows:

Attribute	Value
Total parameters	235 billion
Activated per token	22 billion
MoE experts per block	128
Activated experts per token	8
Transformer layers	94
Attention heads	64 QK / 4 V
Context window	128K–256K tokens

(Yang et al., 14 May 2025, Bai et al., 26 Nov 2025)

2. Training Regimen and Data

Qwen3-Max is pretrained on 36 trillion tokens covering 119 languages and multiple domains, including STEM, code, books, dialogue, and synthetic data. Pretraining employs staged sequence length growth (4096 → 32K tokens) and domain specialization phases (STEM/code). Post-training involves a 4-stage pipeline: supervised fine-tuning (SFT) on verified chain-of-thought (CoT) data, RL-based reasoning optimization (GRPO), mode-fusion SFT (mixing /think and /no_think templates), and multi-task RL for general skill acquisition.

The learning objective is a combined loss:

$\mathcal{L} = \mathcal{L}_\text{CE} + \alpha_\text{balance}\,\mathcal{L}_\text{MoE} + \beta\,\mathcal{L}_\text{RL}$

where $\mathcal{L}_\text{MoE}$ encourages balanced expert utilization (Yang et al., 14 May 2025).

Inference exploits high parallelism (HBM-flash attention, tensor/pipeline parallel, up to 8×A100). Quantization (INT8 via GPTQ) enables 2×–3× memory reduction with negligible quality penalty.

3. Multimodal and Vision-Language Capabilities

Qwen3-Max is the backbone for Qwen3-VL, supporting interleaved text, image, and video processing with a native 256,000-token context (Bai et al., 26 Nov 2025). The multimodal pipeline incorporates:

Interleaved-MRoPE: rotary positional encoding interleaved across temporal, horizontal, and vertical axes, enhancing spatio-temporal modeling.
DeepStack fusion: multi-resolution ViT feature extraction fused into the transformer’s early layers via an MLP and residual connections.
Text-based time alignment: explicit timestamp tokens for fine-grained grounding in video analysis.

Qwen3-VL uses a top-k router for expert assignment (typically k=2 at inference) and enforces load-balancing via an auxiliary loss to maintain expert utilization uniformity.

4. Empirical Performance Across Benchmarks

Qwen3-235B-A22B consistently attains or approaches state-of-the-art results across a wide range of text, code, reasoning, and multimodal benchmarks. Key reported metrics include:

Benchmark	Score (Qwen3-Max)	Reference
MMLU-Redux	92.7 (thinking)	(Yang et al., 14 May 2025)
GSM8K	94.39 (think)	(Yang et al., 14 May 2025)
MATH	71.84 (think)	(Yang et al., 14 May 2025)
LiveCodeBench v5	70.7 (think)	(Yang et al., 14 May 2025)
MMMU (MM)	80.6 (think, VL)	(Bai et al., 26 Nov 2025)
MathVistaₘᵢₙᵢ (MM)	85.8 (think, VL)	(Bai et al., 26 Nov 2025)

On PEDIASBench, Qwen3-Max demonstrates >90% accuracy on basic pediatric knowledge (single-choice, licensing-level) but exhibits a ∼15 point drop on complex integrative reasoning and a moderate decline from initial diagnosis (T1) to clinical management (T2) (Zhu et al., 17 Nov 2025). In the OBJEX(MT) benchmark for LLM-as-a-judge under jailbreaks, Qwen3-Max’s overall extraction accuracy is 0.441, with severe calibration issues (mean confidence 0.888, [email protected] = 52.4%), indicating overconfidence on adversarial or obfuscated transcripts (Kim et al., 23 Aug 2025).

Memory and latency efficiency are hallmark traits. The MoE structure ensures 2–3× reduction in per-token FLOPs versus a dense 235B model. With INT8 and vLLM’s PagedAttention, memory usage decreases by ≈40% versus dense counterparts at the same parameter scale (Bai et al., 26 Nov 2025).

5. Model Compression: MoBE

The Mixture-of-Basis-Experts (MoBE) paradigm yields a 24% reduction in Qwen3-Max’s parameter count (to ≈179B) with a minimal 0.6-point (≅0.7% relative) drop in average accuracy across diverse tasks (Chen et al., 7 Aug 2025). MoBE factorizes each expert’s up/gate weights $W^i \in \mathbb{R}^{1536 \times 4096}$ as $A^iB^i$ , with $A^i \in \mathbb{R}^{1536 \times 1536}$ and $B^i$ as a convex combination of 32 shared basis matrices. The loss function for the factorization minimizes mean squared error under simplex constraints on mixing weights. Empirical evidence shows MoBE halves the reconstruction error compared to SVD-based compression. MoBE†, with reduced activated experts, further aligns memory/compute with the original model at some additional (<2 point) loss.

6. Applications and Domain-Specific Evaluations

Clinical and Medical Reasoning

On PEDIASBench, Qwen3-Max achieves:

Single-choice accuracy (by physician level): 91.2% (resident), 90.3% (junior), 89.1% (intermediate), 88.75% (senior)
Multiple-choice integrative reasoning: 15-point drop from simple to complex questions
Ethics/safety: 89.4% accuracy (91% in clinical ethics)
Dynamic diagnosis (case reasoning): mean score 0.56 (12-model average: 0.54); drop from T1 to T2 (0.58→0.54)

Qwen3-Max is not recommended for unsupervised, autonomous clinical use; its optimal role is as a decision support or educational assistant, especially when augmented with retrieval and multimodal extensions (Zhu et al., 17 Nov 2025).

Evaluation as an Automated Judge

On the OBJEX(MT) benchmark, Qwen3-Max demonstrates:

Tied objective extraction accuracy with GPT-4.1: 0.441
Inferior calibration versus claude-sonnet-4 (ECE 0.447, Brier 0.441)
Consistently high overconfidence across datasets ([email protected] ≈ 52.4%)
Sharp performance variability (e.g., SafeMT Attack_600: 0.210, MHJ_local: 0.733)

These results emphasize the operational risk of relying solely on the model for adversarial evaluation settings; explicit objectives and robust uncertainty management protocols are required (Kim et al., 23 Aug 2025).

Multimodal and Agentic Workflows

Qwen3-VL (with Qwen3-Max backbone) achieves leading performance on image/video reasoning (MathVista, MathVision, VideoMMMU) and complex agentic tasks (ScreenSpot Pro, GUI control). Ultra-long context enables cross-referencing in extensive documents and videos, positioning the model for agentic decision-making and UI→code generation (Bai et al., 26 Nov 2025).

7. Strengths, Limitations, and Future Directions

Strengths

State-of-the-art performance on knowledge-intensive, reasoning, and multimodal benchmarks
Highly efficient FLOPs/memory due to MoE sparsity and compression (MoBE)
Robust training for both chain-of-thought reasoning and high-throughput inference
Flexible deployment with tunable thinking budget, open-source release, and comprehensive multilingual support

Limitations

Marked overconfidence and calibration issues in adversarial/jailbreak evaluation
Nontrivial drop in integrative/complex reasoning tasks
Lacks full humanlike empathy and “humanistic sensitivity” in open-ended, high-stakes domains (e.g., pediatric care)
Absolute interpretability and trustworthiness limited without further RAG/fine-tuning/multimodal extension

Directions for Improvement

Multimodal integration with physiological time series and imaging to bolster real-world utility
Retrieval-augmented generation for rare or ambiguous cases
Clinical feedback loops and expert-in-the-loop refinement
Kernel-level support for MoBE-style compression to unlock further deployment efficiency

References

Qwen3 Technical Report (Yang et al., 14 May 2025)
ObjexMT Benchmark (Kim et al., 23 Aug 2025)
MoBE: Mixture-of-Basis-Experts (Chen et al., 7 Aug 2025)
Can LLMs Function as Qualified Pediatricians? (Zhu et al., 17 Nov 2025)
Qwen3-VL Technical Report (Bai et al., 26 Nov 2025)

Markdown Report Issue Upgrade to Chat

References (5)

Qwen3 Technical Report (2025)

Qwen3-VL Technical Report (2025)

Can Large Language Models Function as Qualified Pediatricians? A Systematic Evaluation in Real-World Clinical Contexts (2025)

ObjexMT: Objective Extraction and Metacognitive Calibration for LLM-as-a-Judge under Multi-Turn Jailbreaks (2025)

MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3-Max (Qwen3-235B-A22B).

Qwen3-Max: Flagship MoE Transformer

1. Model Architecture and Design

2. Training Regimen and Data

3. Multimodal and Vision-Language Capabilities

4. Empirical Performance Across Benchmarks

5. Model Compression: MoBE

6. Applications and Domain-Specific Evaluations

Clinical and Medical Reasoning

Evaluation as an Automated Judge

Multimodal and Agentic Workflows

7. Strengths, Limitations, and Future Directions

Strengths

Limitations

Directions for Improvement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Qwen3-Max: Flagship MoE Transformer

1. Model Architecture and Design

2. Training Regimen and Data

3. Multimodal and Vision-Language Capabilities

4. Empirical Performance Across Benchmarks

5. Model Compression: MoBE

6. Applications and Domain-Specific Evaluations

Clinical and Medical Reasoning

Evaluation as an Automated Judge

Multimodal and Agentic Workflows

7. Strengths, Limitations, and Future Directions

Strengths

Limitations

Directions for Improvement

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research