ZAYA1 Model Architecture

Updated 25 November 2025

ZAYA1 model architecture is a mixture-of-experts transformer that integrates MI300X-aware tuning, custom convolutional attention, and expert routing to optimize large-scale training.
The design incorporates per-layer residual scaling, rotary embeddings, and specialized AMD-specific kernels to maximize throughput and minimize latency.
It achieves a competitive balance between dense and MoE components, yielding strong evaluation results across tasks like reasoning, mathematics, and coding.

The ZAYA1 model architecture is a mixture-of-experts (MoE) transformer designed for large-scale training on AMD MI300X GPUs with Pollara interconnect. ZAYA1-base incorporates a suite of systems and modeling innovations tailored to the AMD hardware stack, including MI300X-aware dimensioning, custom convolutional attention mechanisms, per-layer residual scaling, and expert routing. The architecture achieves a competitive balance of training throughput and inference latency with strong evaluation results across tasks, establishing the maturity of AMD’s distributed compute environment for state-of-the-art pretraining (Anthony et al., 21 Nov 2025).

1. Overall Model Structure

ZAYA1-base is built with $L=40$ transformer layers and an embedding dimension $h=2048$ . The vocabulary size is $v=262{,}272$ , chosen to be divisible by 64 for optimized device throughput. Each transformer layer contains an MoE block comprising $E=16$ experts, with a top- $k=1$ expert selected per token at each routing step. This yields $8.3$ billion total parameters (considering all experts) but an “active” parameter count of $760$ million (the dense backbone plus one expert per token path).

The forward path through each transformer layer $\ell$ follows this sequence:

Residual-scaled RMSNorm $\rightarrow$ Compressed Convolutional Attention (CCA) $\rightarrow$ residual add
Residual-scaled RMSNorm $\rightarrow$ ZAYA1 Router gating $\rightarrow$ expert MLP (MoE) $\rightarrow$ residual add
Final RMSNorm

Residual scaling is implemented on every residual path via per-channel learnable gates.

2. Transformer Layer Components

Attention and Token Path

CCA attention receives input $x_\ell \in \mathbb{R}^{B \times S \times h}$ and projects it to queries, keys, and values with the following details:

$a=16$ total attention heads, each with head dimension $d_h = h / a = 128$
Query heads: $a_q = 8$ ( $c_q = 1/2$ )
Key/value heads: $g = 2$ ( $c_{kv} = 1/8$ )

Projections:

$W_Q \in \mathbb{R}^{h \times (a_q \cdot d_h)}$
$W_K, W_{V1}, W_{V2}$ with analogous dimensions (with $V1$ and $V2$ handling half of key/value each)

CCA then applies a convolutional stage:

Depthwise conv1d ( $k_0=2$ ) plus grouped conv1d (groups $a_q + g$ , $k_1=2$ ) along the sequence
FlashAttention operates in a compressed latent space of size $(a_q \cdot d_h)$
Rotary position embeddings (RoPE) are applied to half the channels of each head, supporting 4k–1M context extension

Outputs are projected back via $W_O \in \mathbb{R}^{(a_q \cdot d_h) \times h}$ , followed by RMSNorm (with $\epsilon = 10^{-5}$ ) and per-head key temperature.

MoE and Routing

MoE routing in ZAYA1 involves the following operations for each token:

Down-projection: $W_{\text{down}} \in \mathbb{R}^{h \times D}$ , where $D=256$
Exponential Depth Averaging (EDA): $r_\ell = W_{\text{down}} x_\ell + \gamma \cdot r_{\ell-1}$ (with learned scalar $\gamma$ )
Outputs go to a 3-layer MLP (GeLU activations), yielding logits $W_{\text{gate},\ell} \in \mathbb{R}^{D \times E}$
Post-softmax, each token's expert is selected as $e_{\text{idx}} = \arg\max_j(s_\ell + b_\ell)$ , with bias vector $b_\ell \in \mathbb{R}^E$

The chosen expert’s MLP has weights:

First FC: $W_{e1} \in \mathbb{R}^{h \times f}$ , with $f = 4096$ (hidden expansion factor $\alpha=2$ )
Activation: SwiGLU across pre-activation width $f$
Second FC: $W_{e2} \in \mathbb{R}^{f_o \times h}$ where $f_o = f / 2 = 2048$
Followed by residual addition and RMSNorm

3. MI300X-Aware Sizing Principles

The architecture’s sizing rules and GEMM shapes are directly informed by MI300X hardware characteristics:

All core dimensions ( $h, d_h, D, f, f_o$ ) are set as multiples of 64, maximizing rocBLAS/hipBLASLt performance
Microbatch product $b \cdot s \cdot h$ is divisible by 64, and $(b \cdot a) / t$ is integer to avoid padding overhead
MLP expansion factor is fixed ( $f = 2h, f_o = h$ )
MoE per-layer parameter count: $h \cdot f + f_o \cdot h$
Convolutional and attention kernel sizes, e.g., $2048 \times 1024$ , are chosen based on MI300X TFLOPs heatmaps to maximize utilization

These practices are derived from explicit MI300X benchmarking, targeting “hot” performance regions for compute and memory transfers.

4. AMD-Specific Kernels and Communication

The model stack incorporates several AMD-specific optimizations:

Component	Optimization/Detail
CCA conv kernels	Tuned for MI300X HBM2 bandwidth and warp size
Custom HIP kernels	Multi-tensor Muon optimizer kernels; fused residual-add + RMSNorm kernels (two-stage)
Communication	Gradient-fusion buffer sizes saturate Pollara 400 Gbps at break-even; ZeRO-1/context-parallel worlds aligned to xGMI hardware node boundaries

The optimization of collective communication primitives (all-reduce, reduce-scatter, all-gather, broadcast) as well as kernel fusion is critical for training throughput on MI300X + Pollara platforms.

5. Parameter and Compute Profile

Per-layer parameter and FLOPs breakdown, with $t=1$ (no tensor/data parallelism) (Anthony et al., 21 Nov 2025):

Component	Parameter Count (per layer)	FLOPs per token (approx.)
Attention Q,K,V,O	$\approx 5.2$ M	$\approx 9.5$ kM
CCA convs + RoPE	—	$\approx 0.02$ GFLOPs
Router down-proj	$\approx 0.52$ M	$\approx 1.05$ kM
Router MLP (2)	$\approx 0.13$ M	$\approx 0.13$ kM
Router logits	$0.004$ M	$4.1$k
Expert FC1	$\approx 8.39$ M	$\approx 16.8$ kM
Expert FC2	$\approx 4.19$ M	$\approx 8.4$ kM
Residual scaling	negligible ( $\sim$ 0.004 M)	$\sim$ 0.1k

Total per-layer parameters: $\approx 18.4$ M Total per-layer FLOPs per token: $\approx 36$ k A forward pass over $S=1024$ tokens, $b=1$ , totals $\approx 37$ M FLOPs per layer; all 40 layers give $\approx 1.5$ G FLOPs per sample. Inference latency is dominated by expert MLPs (60%), attention kernels (30%), and routing/norms (10%).

6. Special Architectural Components

Embeddings: Token embeddings $E_{\text{tok}} \in \mathbb{R}^{v \times h}$ , tied with the LM head.
Normalization: All RMSNorm (no learnable bias); router MLP uses standard LayerNorm before GeLU.
Activation Functions: GeLU in router blocks; SwiGLU within expert MLPs.
Rotary Embeddings: RoPE is applied to half of each head’s channels only, supporting long-context extrapolation.
Residual Scaling: Per-layer parameterized by $\alpha, \beta \in \mathbb{R}^h$ and bias $b_\ell$ :

$\text{ResScale}_\alpha(x) = \alpha \odot x + b_\ell$

CCA Compression: Query compression $2\times$ , key/value compression $8\times$ , denoted "CCGQA" in model documentation.

7. Comparative Performance and Context

ZAYA1-base achieves performance at or above leading models of similar and larger active scale (Qwen3-4B, Gemma3-12B) and outperforms Llama-3-8B and OLMoE on benchmarks targeting reasoning, mathematics, and coding. The empirical findings suggest that the combination of tailored architecture and hardware-aware engineering enables the AMD stack to match or exceed the competitiveness of established foundation model pretraining environments (Anthony et al., 21 Nov 2025).

PDF Markdown Chat (Pro)

References (1)

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to ZAYA1 Model Architecture.