MI300X Transformer Sizing Rules
- The paper presents a systematic methodology that derives transformer and MoE sizing rules from microbenchmark results, kernel roofline analyses, and hardware constraints.
- Using concrete metrics like head dimension multiples of 64 and FLOP-size thresholds, the guidelines ensure that each GEMM and attention operation achieves optimal performance.
- These rules integrate strategies in batch–sequence tiling, MLP/MoE expansion, and interconnect configuration to avoid underutilization bottlenecks and maximize throughput.
MI300X-aware transformer sizing rules are systematic guidelines for parameterizing transformer and mixture-of-experts (MoE) models to maximize throughput and efficiency on AMD Instinct MI300X GPUs. These rules are directly motivated by microbenchmark results, kernel roofline analyses, and hardware characteristics (GEMM throughput, HBM bandwidth, interconnect topology) observed in large-scale pretraining deployments on pure AMD clusters. MI300X-aware sizing covers every dimension of transformer and MoE layer parameterization—including head count, hidden width, MLP expansion, expert counts, and batch/sequence tile sizes—to exploit MI300X’s full-stack hardware and networking features and systematically avoid underutilization bottlenecks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).
1. Attention Block Sizing Principles
Transformer attention on MI300X is governed by three primary constraints: head dimension alignment, GEMM size, and batch–sequence tiling. These constraints ensure high kernel utilization for both GEMM-based attention and projection operations.
- Head dimension divisibility: For transformer hidden size , number of attention heads , and per-head dim , the constraint:
Ensures the and output projection GEMMs hit MI300X’s high performance regime (outer dim , inner dim ). Too small results in severely reduced matrix multiply efficiency (utilization for , for ) (Ambati et al., 31 Oct 2025).
- GEMM FLOP-size constraint: For a single GPU, maximum throughput is achieved when
(with micro-batch size, sequence per GPU, heads, per-head dim), ensuring that the QK and projection reaches MI300X’s 200 TFLOP/s peak. Example: FLOPs.
- Batch–sequence tiling: The grouped dimension must be divisible by $64$ (ideally $128$), aligning matrix shapes to rocBLAS/hipBLASLt’s preferred tiles and preventing inefficient memory access patterns (Anthony et al., 21 Nov 2025).
2. MLP and MoE Block Sizing
Fully Connected (MLP) and MoE layers follow a set of expansion, tiling, and FLOP-scale rules directly informed by MI300X microbenchmarks.
- Feedforward expansion: Use a expansion for the pre-activation width (not traditional ), i.e.,
for SwiGLU activation. This maintains model quality while keeping GEMM sizes above the $200$ GFLOP threshold on MI300X, optimizing throughput (Anthony et al., 21 Nov 2025).
- Expert MLP tiling: For experts per layer and local tokens per expert ,
Ensures and shapes meet or exceed FLOPs, maintaining kernel efficiency. For example, , yields .
- MoE expert count and widths: Empirically,
(where is top- routing) provides maximal kernel occupancy and minimizes all-to-all shuffle volume. Each expert’s pre-activation and post-activation widths mirror the MLP rule to retain architectural and FLOP parity with dense blocks.
3. Hardware-Driven Constraints and Interconnect Sizing
Sizing rules are deeply influenced by MI300X’s hardware features—HBM bandwidth, capacity, and interconnects.
- HBM bandwidth: Sustained on-device memory bandwidth is –$1.1$ TB/s (PyTorch memcpy). Fused kernels (Conv1d, RMSNorm, Muon, FlashAttention) must be sized so their per-kernel memory traffic does not exceed , otherwise HBM-bound slowdowns occur, especially at (Anthony et al., 21 Nov 2025).
- InfinityFabric / intra-node collective bandwidth: Per-GPU intra-node (xGMI) bandwidth is ( GPUs/node, GB/s links), favoring parallelism sharded only across full nodes to keep all-to-all communication collective (Anthony et al., 21 Nov 2025).
- Pollara Interconnect: Each GPU NIC is , giving per node. AllReduce performance plateaus at collective message sizes –. A gradient fusion buffer saturates the bus-bandwidth curve and balances communication–computation overlap.
4. HBM Capacity, Model Size, and Activation Footprints
Total model and activation footprints are limited by MI300X’s HBM capacity.
- Parameter and activation budgeting: For layers, hidden size ,
Where is number of model parameters (total), activation footprint, and bytes per parameter/activation (FP16/FP32). on MI300X (Ambati et al., 31 Oct 2025).
- Working set for memory bandwidth: Activation working set per layer,
should reach at least $64$– to saturate – measured bandwidth.
5. Practical Sizing Workflow and Algorithmic Summary
A practical MI300X-centric sizing workflow synthesizes the prior constraints into actionable steps:
- Choose a hidden size divisible by (e.g., , , ).
- Set feedforward width , leveraging SwiGLU if possible.
- Select expert count so and enable top-1 routing.
- Verify each critical GEMM operation ( and ) meets the GFLOP threshold.
- Align all major dimensions by $64$ (preferably $128$) to ROCm BLAS tile size preferences.
- Use fused kernels for layer norm and attention to remain compute-bound, avoiding the $1$ TB/s HBM bandwidth limits.
- Set gradient fusion to MiB for communication–computation overlap at Pollara’s saturation point.
Collectively, this process ensures that every block and operation remains in the high-efficiency operational regimes pinpointed by empirical MI300X microbenchmarks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).
6. Model Size, Batch, and Sequence Scaling Trade-offs
Scaling rules on MI300X are characterized by quadratic costs in hidden size and thresholds on working-set memory:
- Model size: . Doubling hidden size increases memory requirements .
- Sequence length:
e.g., , gives tokens in FP16.
- Batch size tuning: For full HBM bandwidth, select .
- Utilization: Keep each GEMM shape to maintain , targeting per the empirical utilization curves.
7. Implications, Limitations, and Ongoing Research
The MI300X-aware sizing rules, as described by leading research groups (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025), demonstrate that MI300X’s compute and memory systems support transformer and MoE model design choices qualitatively distinct from legacy GPU worlds, notably by incentivizing highly regular, tile-aligned layouts and lower expansion ratios. The rules reflect a paradigm shift towards tightly hardware–software codesigned architectures, offering throughput and latency competitive with state-of-the-art base models (e.g., Qwen3, Gemma3, Llama-3, OLMoE) at comparable or smaller parameter scales. A plausible implication is that future model and compiler designs targeting MI300X are likely to further reduce per-layer irregularities and harness cross-node collectives for scalable pretraining and inference. The current sizing rules are also constrained by specific kernel and interconnect implementations; continued advances in the ROCm software stack and collective communication are likely to evolve the practical optima in the coming generation of AMD server hardware.