ZAYA1 Model Architecture
- ZAYA1 model architecture is a mixture-of-experts transformer that integrates MI300X-aware tuning, custom convolutional attention, and expert routing to optimize large-scale training.
- The design incorporates per-layer residual scaling, rotary embeddings, and specialized AMD-specific kernels to maximize throughput and minimize latency.
- It achieves a competitive balance between dense and MoE components, yielding strong evaluation results across tasks like reasoning, mathematics, and coding.
The ZAYA1 model architecture is a mixture-of-experts (MoE) transformer designed for large-scale training on AMD MI300X GPUs with Pollara interconnect. ZAYA1-base incorporates a suite of systems and modeling innovations tailored to the AMD hardware stack, including MI300X-aware dimensioning, custom convolutional attention mechanisms, per-layer residual scaling, and expert routing. The architecture achieves a competitive balance of training throughput and inference latency with strong evaluation results across tasks, establishing the maturity of AMD’s distributed compute environment for state-of-the-art pretraining (Anthony et al., 21 Nov 2025).
1. Overall Model Structure
ZAYA1-base is built with transformer layers and an embedding dimension . The vocabulary size is , chosen to be divisible by 64 for optimized device throughput. Each transformer layer contains an MoE block comprising experts, with a top- expert selected per token at each routing step. This yields $8.3$ billion total parameters (considering all experts) but an “active” parameter count of $760$ million (the dense backbone plus one expert per token path).
The forward path through each transformer layer follows this sequence:
- Residual-scaled RMSNorm Compressed Convolutional Attention (CCA) residual add
- Residual-scaled RMSNorm ZAYA1 Router gating expert MLP (MoE) residual add
- Final RMSNorm
Residual scaling is implemented on every residual path via per-channel learnable gates.
2. Transformer Layer Components
Attention and Token Path
CCA attention receives input and projects it to queries, keys, and values with the following details:
- total attention heads, each with head dimension
- Query heads: ()
- Key/value heads: ()
Projections:
- with analogous dimensions (with and handling half of key/value each)
CCA then applies a convolutional stage:
- Depthwise conv1d () plus grouped conv1d (groups , ) along the sequence
- FlashAttention operates in a compressed latent space of size
- Rotary position embeddings (RoPE) are applied to half the channels of each head, supporting 4k–1M context extension
Outputs are projected back via , followed by RMSNorm (with ) and per-head key temperature.
MoE and Routing
MoE routing in ZAYA1 involves the following operations for each token:
- Down-projection: , where
- Exponential Depth Averaging (EDA): (with learned scalar )
- Outputs go to a 3-layer MLP (GeLU activations), yielding logits
- Post-softmax, each token's expert is selected as , with bias vector
The chosen expert’s MLP has weights:
- First FC: , with (hidden expansion factor )
- Activation: SwiGLU across pre-activation width
- Second FC: where
- Followed by residual addition and RMSNorm
3. MI300X-Aware Sizing Principles
The architecture’s sizing rules and GEMM shapes are directly informed by MI300X hardware characteristics:
- All core dimensions () are set as multiples of 64, maximizing rocBLAS/hipBLASLt performance
- Microbatch product is divisible by 64, and is integer to avoid padding overhead
- MLP expansion factor is fixed ()
- MoE per-layer parameter count:
- Convolutional and attention kernel sizes, e.g., , are chosen based on MI300X TFLOPs heatmaps to maximize utilization
These practices are derived from explicit MI300X benchmarking, targeting “hot” performance regions for compute and memory transfers.
4. AMD-Specific Kernels and Communication
The model stack incorporates several AMD-specific optimizations:
| Component | Optimization/Detail |
|---|---|
| CCA conv kernels | Tuned for MI300X HBM2 bandwidth and warp size |
| Custom HIP kernels | Multi-tensor Muon optimizer kernels; fused residual-add + RMSNorm kernels (two-stage) |
| Communication | Gradient-fusion buffer sizes saturate Pollara 400 Gbps at break-even; ZeRO-1/context-parallel worlds aligned to xGMI hardware node boundaries |
The optimization of collective communication primitives (all-reduce, reduce-scatter, all-gather, broadcast) as well as kernel fusion is critical for training throughput on MI300X + Pollara platforms.
5. Parameter and Compute Profile
Per-layer parameter and FLOPs breakdown, with (no tensor/data parallelism) (Anthony et al., 21 Nov 2025):
| Component | Parameter Count (per layer) | FLOPs per token (approx.) |
|---|---|---|
| Attention Q,K,V,O | M | kM |
| CCA convs + RoPE | — | GFLOPs |
| Router down-proj | M | kM |
| Router MLP (2) | M | kM |
| Router logits | $0.004$ M | $4.1$k |
| Expert FC1 | M | kM |
| Expert FC2 | M | kM |
| Residual scaling | negligible (0.004 M) | 0.1k |
Total per-layer parameters: M Total per-layer FLOPs per token: k A forward pass over tokens, , totals M FLOPs per layer; all 40 layers give G FLOPs per sample. Inference latency is dominated by expert MLPs (60%), attention kernels (30%), and routing/norms (10%).
6. Special Architectural Components
- Embeddings: Token embeddings , tied with the LM head.
- Normalization: All RMSNorm (no learnable bias); router MLP uses standard LayerNorm before GeLU.
- Activation Functions: GeLU in router blocks; SwiGLU within expert MLPs.
- Rotary Embeddings: RoPE is applied to half of each head’s channels only, supporting long-context extrapolation.
- Residual Scaling: Per-layer parameterized by and bias :
- CCA Compression: Query compression , key/value compression , denoted "CCGQA" in model documentation.
7. Comparative Performance and Context
ZAYA1-base achieves performance at or above leading models of similar and larger active scale (Qwen3-4B, Gemma3-12B) and outperforms Llama-3-8B and OLMoE on benchmarks targeting reasoning, mathematics, and coding. The empirical findings suggest that the combination of tailored architecture and hardware-aware engineering enables the AMD stack to match or exceed the competitiveness of established foundation model pretraining environments (Anthony et al., 21 Nov 2025).