Qwen-30B-A3B: Advanced Multimodal MoE Transformer

Updated 28 December 2025

Qwen-30B-A3B is a cutting-edge modular Mixture-of-Experts Transformer architecture that integrates text, vision, audio, and video modalities.
It employs a dual-path design with a 'Thinker' for language and a 'Talker' for speech synthesis, enabling efficient streaming generation and ultra-low latency.
By unifying multimodal perception with dynamic nested depth and adaptive token routing, it achieves state-of-the-art performance across diverse benchmarks.

Qwen-30B-A3B denotes a flagship 30-billion-parameter modular Mixture-of-Experts (MoE) Transformer configuration within the Qwen3-Omni model series, supporting unified multimodal perception (text, vision, audio, video) and real-time streaming generation. At this scale, the A3B-annotated “Thinker” component and the specialized “Talker” speech decoder together provide state-of-the-art performance on a diverse range of language, audio, and vision benchmarks with efficient token routing and ultra-low first-packet latency. Dynamic Nested Depth techniques further enhance its inference performance and stability via selective multi-pass routing and adaptive thresholding mechanisms.

1. Architectural Overview

Qwen3-30B-A3B is built upon a sparsely activated Mixture-of-Experts (MoE) Transformer architecture. The model comprises 48 decoder layers with each MoE block employing token-level top-k expert routing; for text and multimodal tasks, the “Thinker” utilizes a 30B parameter MoE Transformer (A3B, i.e., dense→3-expert routing per layer), while speech generation (“Talker”) is performed by a 3B-parameter A0.3B MoE Transformer (dense→1-expert routing). Both employ learned gating networks to assign token representations $x \in \mathbb{R}^d$ to a selection set $S(x)$ , where

${\rm MoE}(x) = \sum_{i\in S(x)} g_i(x)\,E_i(x),\quad g_i(x)=\frac{\exp(G_i(x))}{\sum_{j\in S(x)}\exp(G_j(x))}$

with $N$ experts per layer (e.g., $N=48$ ), only $k \ll N$ active per token (Xu et al., 22 Sep 2025).

Key configuration parameters (Thinker):

Layers	Hidden Size $H$	Attention Heads $N_h$	MoE Experts $N$	MoE Actives $k$	Sequence Length $S$
48	2048	32	48	8 (training), 3 (inference)	16,384 (training), 32K (long-context)

The model backbone is shared across modalities via Time-aligned Multimodal Rotary Position Embedding (TM-RoPE), enabling synchronized token processing with temporal anchoring at 80ms intervals (Xu et al., 22 Sep 2025).

2. Unified Multimodal Perception and Generation

Qwen3-30B-A3B natively integrates text, image, audio, and video modalities:

Text: Uses byte-level BPE with a 151,643-token vocabulary; supports 119 writing systems.
Vision: Employs a 543M-parameter SigLIP2-So400M ViT encoder for images and video frames.
Audio: Utilizes a 650M-parameter Audio Transformer trained on 20M hours supervised audio, with 12.5 Hz mel-spectrogram downsampling.
Video: Inputs temporally sampled and aligned at 80ms steps.

Modality features are encoded and fed into the same Transformer stack, allowing tasks such as cross-modal reasoning, image captioning, audio transcription, and agentic response generation within a common architecture (Xu et al., 22 Sep 2025).

3. Streaming Speech Synthesis and Latency Optimization

For real-time agentic interaction, Qwen3-30B-A3B implements a streaming speech synthesis stack that replaces computationally intensive block-wise diffusion with a causal ConvNet. The “Talker” autoregressively predicts a multi-codebook discrete codec representation of speech:

The backbone predicts the “zeroth” codebook $C_0^{(t)}$ .
A Multi-Token Prediction (MTP) module predicts all residual codebooks in parallel.
The Code2Wav causal ConvNet synthesizes the audio waveform chunk (80ms).

Latency decomposition (cold-start, single concurrency):

Preprocessing: 72ms
Thinker to first token: 88ms
Talker to first token: 57ms
MTP pass: 14ms
Decoder: 3ms
Total theoretical first-packet latency: 234ms (audio), 547ms (video) (Xu et al., 22 Sep 2025).

MoE scaling maintains low-latency under multi-streaming conditions (4–6 streams).

4. Dynamic Nested Depth (DND) Enhancement

Dynamic Nested Depth (DND) is integrated into the Qwen3-30B-A3B MoE backbone to selectively reprocess critical tokens within each Transformer layer. In layers 4–43:

A simple router network produces a selection probability $p_i^{(\ell)} = \sigma(R(x_i^{v}))$
Tokens with $p_i^{(\ell)} > \tau$ are routed for a nested second pass through the same layer.
Output fusion: For selected tokens,

$x_i = (\beta p_i) x_i^v + (1-\beta p_i) x_{d,i}$

where $x_i^v$ is the vanilla output, $x_{d,i}$ is the nested output, and $\beta$ is a learnable weight.

DND employs two novel losses:

Score-dispersion ( $L_{sd}$ ): Maximizes entropy among selected scores.
Distribution-preservation ( $L_{dp}$ ): MSE regularization to prevent saturation at the sigmoid boundary.

Threshold $\tau$ is dynamically controlled via buffer-proportional updates and EMA synchronization to maintain a target selection ratio ( $k_\text{target} = 20\%$ ) (Chen et al., 13 Oct 2025).

This DND integration yields:

+0.87 SFT accuracy improvement across 17 benchmarks
Parameter overhead: +0.03M (for routers and gates)
FLOPs overhead: +6.3% relative to baseline (7.52% per DND layer, applied to 40 out of 48 layers)

Largest gains are observed on coding, reasoning, and alignment tasks (e.g., +2.05 BFCL v3 coding, +1.83 C-Eval alignment) (Chen et al., 13 Oct 2025).

5. Empirical Performance Across Benchmarks

Qwen3-30B-A3B constitutes the backbone for Qwen3-Omni-30B-A3B variants, achieving:

Text → Text: Outperforms Qwen3-235B-A22B and GPT-4o-0327 on GPQA, AIME25, ZebraLogic, WritingBench, PolyMath (Instruct variant).

Vision → Text: Matches or exceeds closed-source models (e.g., on MMMU-Pro, MathVista-mini, and MATH-Vision).

Audio → Text (ASR, S2TT): Achieves open-source SOTA on 32/36 benchmarks; e.g., LibriSpeech WER 1.22/2.48 (clean/other), CommonVoice WER 5.33.

Cross-modal Reasoning: “Thinking” variant shows +4.4pt gains on STEM vision-language reasoning; on audio-visual benchmarks, achieves or surpasses Gemini-Pro and GPT-4o-Transcribe.

Music understanding and AV tasks: Approaches or leads previous open SOTA (e.g., RUL-MuchoMusic micro-F1 52.0; WorldSense AV understanding 54.0 vs prior 47.1) (Xu et al., 22 Sep 2025).

6. Specialized Variants and Training Schedule

The Qwen3-30B-A3B infrastructure enables multiple specialized models:

Thinking: With chain-of-thought tuning and strong-to-weak distillation for multimodal reasoning.
Captioner: Fine-tuned for low-hallucination, high-descriptiveness audio captioning.
Instruct: For default assistant and agent tasks.

All are open-sourced under Apache 2.0 (Xu et al., 22 Sep 2025).

Training employs a three-stage pipeline over ~2 trillion tokens:

Encoder alignment (LLM frozen; adapters trained on text-paired data)
Joint multimodal training (all parameters unfrozen; chunk length 8K)
Long-context training (token length up to 32K)

Audio encoder is trained from scratch on 20M hours. “Thinker” receives SFT, two-phase distillation, and RLHF; “Talker” undergoes sequential multimodal pretraining, DPO, and speaker fine-tuning.

7. Context, Impact, and Release

Qwen3-30B-A3B demonstrates that multimodal MoE transformers at this scale can match or surpass the performance of same-sized single-modal models without empirical trade-offs, supporting seamless cross-modal interaction, robust agentic planning, and streaming generation at real-time latency. The architecture supports robust scaling under high concurrency and is released under an open-source license, facilitating further research and practical deployment in cross-domain applications (Xu et al., 22 Sep 2025, Chen et al., 13 Oct 2025).

PDF Markdown Chat (Pro)

References (2)

Qwen3-Omni Technical Report (2025)

DND: Boosting Large Language Models with Dynamic Nested Depth (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to Qwen-30B-A3B.