Papers
Topics
Authors
Recent
Search
2000 character limit reached

Qwen3 Transformer Architecture

Updated 17 March 2026
  • Qwen3 Transformer Architecture is a scalable family of transformer models that integrates dense and sparse Mixture-of-Experts methods for efficient processing.
  • It employs expert-sparse transformer blocks, adaptive reasoning control, and dual-stack design to balance latency and high performance.
  • The multimodal Qwen3-Omni variant uses specialized encoders and a two-stage speech processing strategy to excel in text, vision, audio, and video tasks.

Qwen3 Transformer Architecture defines a scalable family of large language and multimodal transformer models, ranging from 0.6 billion to 235 billion parameters, that integrate innovations in model sparsity, reasoning/response control, and multimodal alignment. Qwen3 introduces both dense (fully-activated feedforward) and sparse Mixture-of-Experts (MoE) variants, supports chain-of-thought and non-thinking response modes, and, in its multimodal Qwen3-Omni instantiation, achieves open-source state-of-the-art across text, vision, audio, and video benchmarks. Central architectural contributions include expert-sparse transformer blocks, adaptive reasoning token budgets, and a dual-stack “Thinker-Talker” division to streamline multimodal perception and low-latency generation (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025, Bai et al., 26 Nov 2025).

1. Model Variants and Transformer Backbone

The Qwen3 family comprises dense and Mixture-of-Experts (MoE) models, all sharing a pre-layer-normalization (pre-LN) transformer-decoder stack. Dense variants use conventional 2-layer MLP feedforwards; MoE variants substitute these with sparse, top-kk-gated expert networks to reduce computational load per token while matching or exceeding dense model performance.

Dense model configurations range from 0.6B (28 layers, 16 Q-heads) to 32B parameters (64 layers, 64 Q-heads). MoE configurations include Qwen3-30B-A3B (48 layers, 32/4 Q/KV heads, 128 experts with top-8 gating) and Qwen3-235B-A22B (94 layers, 64/4 Q/KV heads, similar expert routing). All models employ rotary positional encodings (RoPE), Grouped Query Attention (GQA), QK-Normalization, and SwiGLU MLP activations. Context length support scales with model size, from 32K (low-end) up to 256K tokens in vision-LLMs (Yang et al., 14 May 2025, Bai et al., 26 Nov 2025).

MoE transformer blocks are structured as follows:

  • LayerNorm \rightarrow Multi-Head Self-Attention \rightarrow residual add
  • LayerNorm \rightarrow MoE-MLP \rightarrow residual add

In MoE layers, a top-kk gating mechanism selects sparse expert routes for each token, and a global batch-level load balancing loss is used to ensure uniform expert utilization (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025).

2. Mixture-of-Experts (MoE) Formulation

Qwen3 MoE blocks generalize the standard feedforward by dispatching each token, after LayerNorm, to a select subset of expert MLPs. For each input xRdx \in \mathbb{R}^d, the gating network computes g(x)=Softmax(Wgx)REg(x) = \mathrm{Softmax}(W_g x) \in \mathbb{R}^E over EE experts, retaining only the top-kk entries (typical k=2k = 2 for Qwen3-Omni). Each expert is a two-layer MLP, such that

Ee(x)=W2,eϕ(W1,ex+b1,e)+b2,eE_e(x) = W_{2,e} \cdot \phi(W_{1,e} x + b_{1,e}) + b_{2,e}

where ϕ\phi is GELU. The output is a gate-weighted sum across the active experts,

z=e=1Ege(x)Ee(x)z = \sum_{e=1}^E g_e(x) E_e(x)

with ge(x)=0g_e(x) = 0 for pruned experts. A per-batch load-balancing loss,

Lload=λEe[Ge]2+μEe[Ue]2\mathcal{L}_{\text{load}} = \lambda \mathbb{E}_e [G_e]^2 + \mu \mathbb{E}_e [U_e]^2

where Ge=Exbatch[ge(x)]G_e = \mathbb{E}_{x \sim \text{batch}}[g_e(x)] is the mean gate probability and Ue=Ex[I[eTopK(x)]]U_e = \mathbb{E}_x[ I[e \in \text{TopK}(x)] ] is the fraction of tokens routed to ee, regularizes both assignment probabilities and token allocations toward uniformity. Qwen3 models feature fine-grained expert segmentation with no sharing and perform top-kk routing per token (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025).

3. Reasoning Modes and Dynamic Control

Qwen3 integrates “thinking mode” (explicit chain-of-thought reasoning) and “non-thinking mode” (rapid, template-driven answers) within a single model. Mode switching is controlled via prompt-level tokens (“/think” or “/no_think”), with internal reasoning demarcated by > ... blocks. The core transformer weights remain unchanged; mode affects only output token content and length.

A user-specified thinking budget BB bounds the reasoning phase: when the number of generated reasoning tokens tthinkBt_{\text{think}} \geq B, the model halts the <think> block and returns the answer. This strategy enables users to balance latency and performance at inference by strictly limiting compute spent on chain-of-thought generation (Yang et al., 14 May 2025).

The design contrasts with dual-graph approaches and does not require content- or gradient-based dynamic routing; all decisions are driven by prompt and control tokens.

4. Multimodal Integration in Qwen3-Omni

Qwen3-Omni extends Qwen3 to process and generate outputs across text, image, audio, and video, matching or exceeding the performance of single-modal Qwen3 variants. The architecture adopts a dual-stack approach:

  • Thinker: a 48-layer MoE transformer with dmodel=5120d_{\text{model}}=5120, 32 heads, and 64 experts/layer, unified across modalities. Specialized encoders preprocess each stream: a byte-pair embedding for text, AuT encoder for audio (12.5 Hz output, 0.6B params), and SigLIP2-So400M for vision/video.
  • Talker: a 16-layer MoE transformer (dmodel=2048d_{\text{model}}=2048, 16 heads, 16 experts/layer) predicts discrete speech codebooks for streaming synthesis.

Temporal-Multimodal RoPE (TM-RoPE) provides rotary embeddings factorized across (time, height, width) for precise spatiotemporal modeling, with absolute temporal IDs (80 ms units) enabling alignment between modalities (Xu et al., 22 Sep 2025).

All encoded tokens are temporally concatenated and input to the Thinker stack. A lightweight “Thinking module” at the top applies four additional layerwise MLPs to generate cross-modal reasoning logits, supporting advanced tasks such as multimodal chain-of-thought and audio captioning.

5. Real-Time Speech Processing and Streaming Synthesis

Qwen3-Omni introduces a two-stage, low-latency speech generation strategy for audio outputs:

  • Multi-codebook quantization: Each 80 ms audio frame is encoded as four residual vector quantizer tokens (each with 1024 entries). The Talker stack predicts yt,0y_{t,0} directly and uses a subsequent dense Multi-Token Prediction (MTP) transformer to autoregressively generate yt,1:C1y_{t,1:C-1}.
  • Causal ConvNet (“Code2Wav”): A small causal 1D convolutional network reconstructs waveform samples from the codebook embeddings at 12.5 Hz, initiating streaming output immediately after the first token is available.

This approach replaces the blockwise diffusion vocoder (e.g., DiT, ≥2 s latency in the base Qwen3) with streaming capabilities and achieves a theoretical end-to-end first-packet latency of 234 ms in cold-start scenarios (Xu et al., 22 Sep 2025).

6. Innovations Relative to Predecessors and Contemporary Models

Qwen3 introduces multiple innovations relative to previous models such as Qwen2.5 and Qwen2.5-VL:

  • Grouped Query Attention (GQA): Increased Q- vs. KV-head counts for more efficient attention scaling.
  • QK-Normalization: Replaces QKV bias for more stable self-attention.
  • Pre-RMSNorm: Layer normalization is applied before attention/MLP (pre-norm residuals).
  • Sparse MoE throughout: All MLP blocks in Qwen3-Omni are replaced with top-kk MoE sub-networks for throughput and efficiency under long-context loads. Fine-grained expert segmentation and batch-global balancing further distinguish Qwen3 MoE from prior implementations.
  • Augmented encoders: Audio (AuT, trained from scratch) and vision (SigLIP2-So400M, replacing prior adapters) serve as unified embedding interfaces across tasks.
  • Flexible mode-controlled reasoning: Unified support for explicit, token-bounded chain-of-thought without separate networks.

These advances allow Qwen3-Omni to maintain single-modal SOTA performance across all target modalities and outperform large closed-source models such as Gemini-2.5-Pro and GPT-4o-Transcribe on a broad battery of benchmarks (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025, Bai et al., 26 Nov 2025).

7. Resource Footprint and Model Scaling

Qwen3’s parameter scaling is summarized as follows for leading configurations:

Model Layers dmodeld_{\text{model}} MoE Experts/Layer Total Params (B) Activated Params/Token (B)
Qwen3-30B (dense) 48-64 4096–6144 ~30 ~30
Qwen3-30B-A3B (MoE) 48 4096–5120 64–128 ~30–35 2–3
Qwen3-235B-A22B (MoE) 94 8192 128 ~235 22
Qwen3-Omni-30B-A3B 48+16 5120/2048 64/16 ~35 2.3

Plausible configurations are inferred for d_model where not quoted in the technical reports

Total parameters are the aggregate over all encoders and the Thinker/Talker division. Only a small fraction are active per token due to MoE sparsity. Qwen3-VL models similarly support large context windows and high activation throughput via expert parallelism (Yang et al., 14 May 2025, Xu et al., 22 Sep 2025, Bai et al., 26 Nov 2025).


Qwen3 thus establishes a unified, expert-sparse transformer architecture supporting efficient scaling, adaptive reasoning, and high-performance multimodal alignment. Architectural and algorithmic contributions substantiate its state-of-the-art standing on open-source language and multimodal benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (3)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Qwen3 Transformer Architecture.