MI300X Transformer Sizing Rules

Updated 24 November 2025

The paper presents a systematic methodology that derives transformer and MoE sizing rules from microbenchmark results, kernel roofline analyses, and hardware constraints.
Using concrete metrics like head dimension multiples of 64 and FLOP-size thresholds, the guidelines ensure that each GEMM and attention operation achieves optimal performance.
These rules integrate strategies in batch–sequence tiling, MLP/MoE expansion, and interconnect configuration to avoid underutilization bottlenecks and maximize throughput.

MI300X-aware transformer sizing rules are systematic guidelines for parameterizing transformer and mixture-of-experts (MoE) models to maximize throughput and efficiency on AMD Instinct MI300X GPUs. These rules are directly motivated by microbenchmark results, kernel roofline analyses, and hardware characteristics (GEMM throughput, HBM bandwidth, interconnect topology) observed in large-scale pretraining deployments on pure AMD clusters. MI300X-aware sizing covers every dimension of transformer and MoE layer parameterization—including head count, hidden width, MLP expansion, expert counts, and batch/sequence tile sizes—to exploit MI300X’s full-stack hardware and networking features and systematically avoid underutilization bottlenecks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).

1. Attention Block Sizing Principles

Transformer attention on MI300X is governed by three primary constraints: head dimension alignment, GEMM size, and batch–sequence tiling. These constraints ensure high kernel utilization for both GEMM-based attention and projection operations.

Head dimension divisibility: For transformer hidden size $H$ , number of attention heads $a$ , and per-head dim $d_h = H/a$ , the constraint:

$d_h \;\text{must be a multiple of 64, typically}\; d_h \ge 128$

Ensures the $QK^T$ and output projection GEMMs hit MI300X’s high performance regime (outer dim $\ge 512$ , inner dim $\ge 1024$ ). Too small $d_h$ results in severely reduced matrix multiply efficiency (utilization $U=0.15$ for $d_h=64$ , $U=0.30$ for $d_h=128$ ) (Ambati et al., 31 Oct 2025).

GEMM FLOP-size constraint: For a single GPU, maximum throughput is achieved when

$(b\,a\,s\,d_h) \gtrsim 2\times10^{11} \,\text{FLOPs}$

(with $b$ micro-batch size, $s$ sequence per GPU, $a$ heads, $d_h$ per-head dim), ensuring that the QK $^T$ and $O$ projection reaches MI300X’s $>$ 200 TFLOP/s peak. Example: $a=16, d_h=128, s=4096, b=5 \implies 2.1\times10^{11}$ FLOPs.

Batch–sequence tiling: The grouped dimension $(b\,s)\cdot d_h$ must be divisible by $64$ (ideally $128$), aligning matrix shapes to rocBLAS/hipBLASLt’s preferred tiles and preventing inefficient memory access patterns (Anthony et al., 21 Nov 2025).

2. MLP and MoE Block Sizing

Fully Connected (MLP) and MoE layers follow a set of expansion, tiling, and FLOP-scale rules directly informed by MI300X microbenchmarks.

Feedforward expansion: Use a $2\times$ expansion for the pre-activation width (not traditional $4\times$ ), i.e.,

$d_{ff} = 2H ,\quad f_o = H$

for SwiGLU activation. This maintains model quality while keeping GEMM sizes above the $200$ GFLOP threshold on MI300X, optimizing throughput (Anthony et al., 21 Nov 2025).

Expert MLP tiling: For $E$ experts per layer and local tokens per expert $\tau_i = s/E$ ,

$\tau_i \ge 256$

Ensures $(\tau_i,H)\times(H,2H)$ and $(\tau_i,H)\times(H,H)$ shapes meet or exceed $2 \times 10^{11}$ FLOPs, maintaining kernel efficiency. For example, $s=4096$ , $E=16$ yields $\tau_i = 256$ .

MoE expert count and widths: Empirically,

$E=16,\quad k=1$

(where $k$ is top- $k$ routing) provides maximal kernel occupancy and minimizes all-to-all shuffle volume. Each expert’s pre-activation and post-activation widths mirror the MLP rule to retain architectural and FLOP parity with dense blocks.

3. Hardware-Driven Constraints and Interconnect Sizing

Sizing rules are deeply influenced by MI300X’s hardware features—HBM bandwidth, capacity, and interconnects.

HBM bandwidth: Sustained on-device memory bandwidth is $B_{\mathrm{HBM}} \approx 1.0$ –$1.1$ TB/s (PyTorch memcpy). Fused kernels (Conv1d, RMSNorm, Muon, FlashAttention) must be sized so their per-kernel memory traffic does not exceed $0.8\,B_{\mathrm{HBM}}$ , otherwise HBM-bound slowdowns occur, especially at $s\le16\,\text{k}$ (Anthony et al., 21 Nov 2025).
InfinityFabric / intra-node collective bandwidth: Per-GPU intra-node (xGMI) bandwidth is $B_{\text{perGPU}} \approx 450\,\mathrm{GB/s}$ ( $n=8$ GPUs/node, $7\times64$ GB/s links), favoring parallelism sharded only across full nodes to keep all-to-all communication collective (Anthony et al., 21 Nov 2025).
Pollara Interconnect: Each GPU NIC is $400\,\mathrm{Gbps}\approx50\,\mathrm{GB/s}$ , giving $400\,\mathrm{GB/s}$ per node. AllReduce performance plateaus at collective message sizes $M\gtrsim8$ – $16\,\mathrm{MiB}$ . A gradient fusion buffer $F\approx16\,\mathrm{MiB}$ saturates the bus-bandwidth curve and balances communication–computation overlap.

4. HBM Capacity, Model Size, and Activation Footprints

Total model and activation footprints are limited by MI300X’s HBM capacity.

Parameter and activation budgeting: For $L$ layers, hidden size $d$ ,

$P(d,L) = 12\,d^2\,L \ A(d,l,b) = 2\,l\,b\,d \ 4\left[ P(d,L) + A(d,l,b) \right] \leq M_{\text{HBM}}$

Where $P(d,L)$ is number of model parameters (total), $A(d,l,b)$ activation footprint, and $4\,$ bytes per parameter/activation (FP16/FP32). $M_{\text{HBM}} = 192\,\mathrm{GB}$ on MI300X (Ambati et al., 31 Oct 2025).

Working set for memory bandwidth: Activation working set per layer,

$W = \frac{2 \times 4\,B \times b \times l \times d}{2^{20}}$

$W$ should reach at least $64$– $128\,\mathrm{MiB}$ to saturate $\sim4.0$ – $4.3\,\mathrm{TB/s}$ measured bandwidth.

5. Practical Sizing Workflow and Algorithmic Summary

A practical MI300X-centric sizing workflow synthesizes the prior constraints into actionable steps:

Choose a hidden size $H$ divisible by $a \times 64$ (e.g., $H=2048$ , $a=16$ , $d_h=128$ ).
Set feedforward width $d_{ff}=2H$ , leveraging SwiGLU if possible.
Select expert count $E$ so $\tau_i=s/E\ge256$ and enable top-1 routing.
Verify each critical GEMM operation ( $bsa\times d_h$ and $\tau_i H\times2H$ ) meets the $\ge200$ GFLOP threshold.
Align all major dimensions by $64$ (preferably $128$) to ROCm BLAS tile size preferences.
Use fused kernels for layer norm and attention to remain compute-bound, avoiding the $1$ TB/s HBM bandwidth limits.
Set gradient fusion to $\sim16$ MiB for communication–computation overlap at Pollara’s saturation point.

Collectively, this process ensures that every block and operation remains in the high-efficiency operational regimes pinpointed by empirical MI300X microbenchmarks (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025).

6. Model Size, Batch, and Sequence Scaling Trade-offs

Scaling rules on MI300X are characterized by quadratic costs in hidden size and thresholds on working-set memory:

Model size: $P\propto d^2 L$ . Doubling hidden size $d$ increases memory requirements $4\times$ .
Sequence length:

$l_{\max}(b,d)=\left\lfloor \frac{M_{\text{HBM}}}{8\,b\,d} \right\rfloor$

e.g., $b=1$ , $d=4096$ gives $l\approx 6000$ tokens in FP16.

Batch size tuning: For full HBM bandwidth, select $b\approx128\,\mathrm{MiB}/(8\,B\cdot l\cdot d)$ .
Utilization: Keep each GEMM shape $\ge1\,\mathrm{k}$ to maintain $U\ge0.6$ , targeting $T(d,h)\ge1,000\,\mathrm{TFLOPS}$ per the empirical utilization curves.

7. Implications, Limitations, and Ongoing Research

The MI300X-aware sizing rules, as described by leading research groups (Anthony et al., 21 Nov 2025, Ambati et al., 31 Oct 2025), demonstrate that MI300X’s compute and memory systems support transformer and MoE model design choices qualitatively distinct from legacy GPU worlds, notably by incentivizing highly regular, tile-aligned layouts and lower expansion ratios. The rules reflect a paradigm shift towards tightly hardware–software codesigned architectures, offering throughput and latency competitive with state-of-the-art base models (e.g., Qwen3, Gemma3, Llama-3, OLMoE) at comparable or smaller parameter scales. A plausible implication is that future model and compiler designs targeting MI300X are likely to further reduce per-layer irregularities and harness cross-node collectives for scalable pretraining and inference. The current sizing rules are also constrained by specific kernel and interconnect implementations; continued advances in the ROCm software stack and collective communication are likely to evolve the practical optima in the coming generation of AMD server hardware.

Markdown Report Issue Upgrade to Chat

References (2)

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design (2025)

AMD MI300X GPU Performance Analysis (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MI300X-Aware Transformer Sizing Rules.