Papers
Topics
Authors
Recent
2000 character limit reached

MI300X Transformer Sizing Rules

Updated 24 November 2025
  • The paper presents a systematic methodology that derives transformer and MoE sizing rules from microbenchmark results, kernel roofline analyses, and hardware constraints.
  • Using concrete metrics like head dimension multiples of 64 and FLOP-size thresholds, the guidelines ensure that each GEMM and attention operation achieves optimal performance.
  • These rules integrate strategies in batch–sequence tiling, MLP/MoE expansion, and interconnect configuration to avoid underutilization bottlenecks and maximize throughput.

MI300X-aware transformer sizing rules are systematic guidelines for parameterizing transformer and mixture-of-experts (MoE) models to maximize throughput and efficiency on AMD Instinct MI300X GPUs. These rules are directly motivated by microbenchmark results, kernel roofline analyses, and hardware characteristics (GEMM throughput, HBM bandwidth, interconnect topology) observed in large-scale pretraining deployments on pure AMD clusters. MI300X-aware sizing covers every dimension of transformer and MoE layer parameterization—including head count, hidden width, MLP expansion, expert counts, and batch/sequence tile sizes—to exploit MI300X’s full-stack hardware and networking features and systematically avoid underutilization bottlenecks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).

1. Attention Block Sizing Principles

Transformer attention on MI300X is governed by three primary constraints: head dimension alignment, GEMM size, and batch–sequence tiling. These constraints ensure high kernel utilization for both GEMM-based attention and projection operations.

  • Head dimension divisibility: For transformer hidden size HH, number of attention heads aa, and per-head dim dh=H/ad_h = H/a, the constraint:

dh  must be a multiple of 64, typically  dh128d_h \;\text{must be a multiple of 64, typically}\; d_h \ge 128

Ensures the QKTQK^T and output projection GEMMs hit MI300X’s high performance regime (outer dim 512\ge 512, inner dim 1024\ge 1024). Too small dhd_h results in severely reduced matrix multiply efficiency (utilization U=0.15U=0.15 for dh=64d_h=64, U=0.30U=0.30 for dh=128d_h=128) (Ambati et al., 31 Oct 2025).

  • GEMM FLOP-size constraint: For a single GPU, maximum throughput is achieved when

(basdh)2×1011FLOPs(b\,a\,s\,d_h) \gtrsim 2\times10^{11} \,\text{FLOPs}

(with bb micro-batch size, ss sequence per GPU, aa heads, dhd_h per-head dim), ensuring that the QKT^T and OO projection reaches MI300X’s >>200 TFLOP/s peak. Example: a=16,dh=128,s=4096,b=5    2.1×1011a=16, d_h=128, s=4096, b=5 \implies 2.1\times10^{11} FLOPs.

  • Batch–sequence tiling: The grouped dimension (bs)dh(b\,s)\cdot d_h must be divisible by $64$ (ideally $128$), aligning matrix shapes to rocBLAS/hipBLASLt’s preferred tiles and preventing inefficient memory access patterns (Anthony et al., 21 Nov 2025).

2. MLP and MoE Block Sizing

Fully Connected (MLP) and MoE layers follow a set of expansion, tiling, and FLOP-scale rules directly informed by MI300X microbenchmarks.

  • Feedforward expansion: Use a 2×2\times expansion for the pre-activation width (not traditional 4×4\times), i.e.,

dff=2H,fo=Hd_{ff} = 2H ,\quad f_o = H

for SwiGLU activation. This maintains model quality while keeping GEMM sizes above the $200$ GFLOP threshold on MI300X, optimizing throughput (Anthony et al., 21 Nov 2025).

  • Expert MLP tiling: For EE experts per layer and local tokens per expert τi=s/E\tau_i = s/E,

τi256\tau_i \ge 256

Ensures (τi,H)×(H,2H)(\tau_i,H)\times(H,2H) and (τi,H)×(H,H)(\tau_i,H)\times(H,H) shapes meet or exceed 2×10112 \times 10^{11} FLOPs, maintaining kernel efficiency. For example, s=4096s=4096, E=16E=16 yields τi=256\tau_i = 256.

  • MoE expert count and widths: Empirically,

E=16,k=1E=16,\quad k=1

(where kk is top-kk routing) provides maximal kernel occupancy and minimizes all-to-all shuffle volume. Each expert’s pre-activation and post-activation widths mirror the MLP rule to retain architectural and FLOP parity with dense blocks.

3. Hardware-Driven Constraints and Interconnect Sizing

Sizing rules are deeply influenced by MI300X’s hardware features—HBM bandwidth, capacity, and interconnects.

  • HBM bandwidth: Sustained on-device memory bandwidth is BHBM1.0B_{\mathrm{HBM}} \approx 1.0–$1.1$ TB/s (PyTorch memcpy). Fused kernels (Conv1d, RMSNorm, Muon, FlashAttention) must be sized so their per-kernel memory traffic does not exceed 0.8BHBM0.8\,B_{\mathrm{HBM}}, otherwise HBM-bound slowdowns occur, especially at s16ks\le16\,\text{k} (Anthony et al., 21 Nov 2025).
  • InfinityFabric / intra-node collective bandwidth: Per-GPU intra-node (xGMI) bandwidth is BperGPU450GB/sB_{\text{perGPU}} \approx 450\,\mathrm{GB/s} (n=8n=8 GPUs/node, 7×647\times64 GB/s links), favoring parallelism sharded only across full nodes to keep all-to-all communication collective (Anthony et al., 21 Nov 2025).
  • Pollara Interconnect: Each GPU NIC is 400Gbps50GB/s400\,\mathrm{Gbps}\approx50\,\mathrm{GB/s}, giving 400GB/s400\,\mathrm{GB/s} per node. AllReduce performance plateaus at collective message sizes M8M\gtrsim816MiB16\,\mathrm{MiB}. A gradient fusion buffer F16MiBF\approx16\,\mathrm{MiB} saturates the bus-bandwidth curve and balances communication–computation overlap.

4. HBM Capacity, Model Size, and Activation Footprints

Total model and activation footprints are limited by MI300X’s HBM capacity.

  • Parameter and activation budgeting: For LL layers, hidden size dd,

P(d,L)=12d2L A(d,l,b)=2lbd 4[P(d,L)+A(d,l,b)]MHBMP(d,L) = 12\,d^2\,L \ A(d,l,b) = 2\,l\,b\,d \ 4\left[ P(d,L) + A(d,l,b) \right] \leq M_{\text{HBM}}

Where P(d,L)P(d,L) is number of model parameters (total), A(d,l,b)A(d,l,b) activation footprint, and 44\,bytes per parameter/activation (FP16/FP32). MHBM=192GBM_{\text{HBM}} = 192\,\mathrm{GB} on MI300X (Ambati et al., 31 Oct 2025).

  • Working set for memory bandwidth: Activation working set per layer,

W=2×4B×b×l×d220W = \frac{2 \times 4\,B \times b \times l \times d}{2^{20}}

WW should reach at least $64$–128MiB128\,\mathrm{MiB} to saturate 4.0\sim4.04.3TB/s4.3\,\mathrm{TB/s} measured bandwidth.

5. Practical Sizing Workflow and Algorithmic Summary

A practical MI300X-centric sizing workflow synthesizes the prior constraints into actionable steps:

  1. Choose a hidden size HH divisible by a×64a \times 64 (e.g., H=2048H=2048, a=16a=16, dh=128d_h=128).
  2. Set feedforward width dff=2Hd_{ff}=2H, leveraging SwiGLU if possible.
  3. Select expert count EE so τi=s/E256\tau_i=s/E\ge256 and enable top-1 routing.
  4. Verify each critical GEMM operation (bsa×dhbsa\times d_h and τiH×2H\tau_i H\times2H) meets the 200\ge200 GFLOP threshold.
  5. Align all major dimensions by $64$ (preferably $128$) to ROCm BLAS tile size preferences.
  6. Use fused kernels for layer norm and attention to remain compute-bound, avoiding the $1$ TB/s HBM bandwidth limits.
  7. Set gradient fusion to 16\sim16 MiB for communication–computation overlap at Pollara’s saturation point.

Collectively, this process ensures that every block and operation remains in the high-efficiency operational regimes pinpointed by empirical MI300X microbenchmarks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).

6. Model Size, Batch, and Sequence Scaling Trade-offs

Scaling rules on MI300X are characterized by quadratic costs in hidden size and thresholds on working-set memory:

  • Model size: Pd2LP\propto d^2 L. Doubling hidden size dd increases memory requirements 4×4\times.
  • Sequence length:

lmax(b,d)=MHBM8bdl_{\max}(b,d)=\left\lfloor \frac{M_{\text{HBM}}}{8\,b\,d} \right\rfloor

e.g., b=1b=1, d=4096d=4096 gives l6000l\approx 6000 tokens in FP16.

  • Batch size tuning: For full HBM bandwidth, select b128MiB/(8Bld)b\approx128\,\mathrm{MiB}/(8\,B\cdot l\cdot d).
  • Utilization: Keep each GEMM shape 1k\ge1\,\mathrm{k} to maintain U0.6U\ge0.6, targeting T(d,h)1,000TFLOPST(d,h)\ge1,000\,\mathrm{TFLOPS} per the empirical utilization curves.

7. Implications, Limitations, and Ongoing Research

The MI300X-aware sizing rules, as described by leading research groups (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025), demonstrate that MI300X’s compute and memory systems support transformer and MoE model design choices qualitatively distinct from legacy GPU worlds, notably by incentivizing highly regular, tile-aligned layouts and lower expansion ratios. The rules reflect a paradigm shift towards tightly hardware–software codesigned architectures, offering throughput and latency competitive with state-of-the-art base models (e.g., Qwen3, Gemma3, Llama-3, OLMoE) at comparable or smaller parameter scales. A plausible implication is that future model and compiler designs targeting MI300X are likely to further reduce per-layer irregularities and harness cross-node collectives for scalable pretraining and inference. The current sizing rules are also constrained by specific kernel and interconnect implementations; continued advances in the ROCm software stack and collective communication are likely to evolve the practical optima in the coming generation of AMD server hardware.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to MI300X-Aware Transformer Sizing Rules.