Papers
Topics
Authors
Recent
2000 character limit reached

Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design (2511.17127v1)

Published 21 Nov 2025 in cs.CL, cs.AI, and cs.DC

Abstract: We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware, utilizing both MI300X GPUs with Pollara interconnect. We distill practical guidance for both systems and model design. On the systems side, we deliver a comprehensive cluster and networking characterization: microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast) across message sizes and GPU counts on Pollara. To our knowledge, this is the first at this scale. We further provide MI300X microbenchmarks on kernel sizing and memory bandwidth to inform model design. On the modeling side, we introduce and apply MI300X-aware transformer sizing rules for attention and MLP blocks and justify MoE widths that jointly optimize training throughput and inference latency. We describe our training stack in depth, including often-ignored utilities such as fault-tolerance and checkpoint-reshaping, as well as detailed information on our training recipe. We also provide a preview of our model architecture and base model - ZAYA1 (760M active, 8.3B total parameters MoE) - which will be further improved upon in forthcoming papers. ZAYA1-base achieves performance comparable to leading base models such as Qwen3-4B and Gemma3-12B at its scale and larger, and outperforms models including Llama-3-8B and OLMoE across reasoning, mathematics, and coding benchmarks. Together, these results demonstrate that the AMD hardware, network, and software stack are mature and optimized enough for competitive large-scale pretraining.

Summary

  • The paper presents a comprehensive exploration of AMD’s full-stack platform for LLM pretraining, covering hardware, software, and networking innovations.
  • It introduces ZAYA1, a novel MoE Transformer with compressed convolutional attention and advanced MLP routing that enhances expert specialization.
  • Empirical benchmarks demonstrate competitive performance against larger models by optimizing kernel fusion, fault tolerance, and communication.

Training Foundation Models on a Full-Stack AMD Platform: Systems, Architecture, and Algorithmic Advances

Systems Overview and Cluster Topology

This work presents a comprehensive case paper of high-scale LLM (LM) pretraining on a full-stack AMD platform, encompassing MI300X GPUs and Pollara networking. The hardware substrate features eight MI300X GPUs per node, each with dedicated Pollara 400Gbps NICs and substantial DDR5 memory. The cluster employs a two-level rails-only network topology, optimizing both cost and communication locality while leveraging the rails-only design for reduced switch requirements and simplified cabling (Figure 1). Figure 1

Figure 1

Figure 1: Architecture of the Zyphra pretraining cluster, showing rails-only switch layout and dedicated storage/management fabric separation.

The software stack is implemented from bare-metal through high-level frameworks, starting at HIP and Composable Kernel (CK), scaling up through ROCm libraries (rocBLAS, RCCL), custom kernel development, and culminating in a forked Megatron-LM enhanced for AMD backends (Figure 2). Figure 2

Figure 2: Full-stack AMD software ecosystem with language/library demarcations by stack layer.

Hardware Characterization: Bandwidth, Compute, and Networking

The authors conduct rigorous microbenchmarking of all major hardware bottlenecks. HBM bandwidth on MI300X is empirically profiled using PyTorch+ROCm, achieving near theoretical bandwidth in sustained realistic training scenarios (Figure 3). Detailed GEMM sweeps reveal significant performance variance as a function of problem shape and kernel selection, with large KK producing smooth throughput scaling and smaller KK requiring larger batch dimensions for equivalent efficiency (Figure 4). Figure 3

Figure 3: Empirical HBM throughput for PyTorch/ROCm, capturing real DL workload efficiency.

Figure 4

Figure 4

Figure 4: BF16 GEMM throughput heatmaps reveal optimal shape-dependent performance for MI300X.

RCCL collective operations—AllReduce, AllGather, ReduceScatter—are systematically benchmarked, both intra-node (InfinityFabric) and inter-node (Pollara), to expose bus bandwidth and latency scaling dynamics under various message sizes and node counts (Figures 5–7). These results directly inform parallelism and optimizer communication strategies for the pretraining stack. Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5

Figure 5: RCCL collective bandwidth and latency profiles for intra-node (InfinityFabric) communication.

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6

Figure 6: Aggregate RCCL scaling for inter-node operations, illustrating network saturation and optimal fusion buffer configuration.

ZAYA1 Model Architecture: Innovations in Attention and Expert Routing

The core modeling contribution centers on ZAYA1, an MoE Transformer architecture integrating Compressed Convolutional Attention (CCA), an expressive MLP-based router with Exponential Depth Averaging (EDA), and learned residual scaling. The detailed block schematic (Figure 7) demonstrates the replacement of vanilla linear routers with a multi-stage, depth-aware MLP routing mechanism, enhancing expert specialization and load balancing with negligible parameter overhead. Figure 7

Figure 7: ZAYA1 architecture showing CCA-based attention and advanced MLP router for expert selection.

CCA compresses all sequence-mixing into a low-dimensional latent space, enabling drastic reductions in both forward compute and inference KV-cache footprint (Figure 8). The MLP router with EDA and PID-inspired bias balancing eliminates the need for high top-kk expert selection or residual experts—contradicting prevailing MoE conventions and producing highly confident, low-entropy routing decisions. Figure 8

Figure 8: Compressed Convolutional Attention pipeline enabling efficient sequence mixing and KV-cache reduction.

Training Strategy, Parallelism, and System Engineering

ZAYA1-base (760M active, 8.3B total parameters) is pretrained over 12T tokens in three curriculum-informed phases (Figure 9). Data mixtures are dynamically reweighted to enhance STEM, code, and long-context capacity. Training leverages ZeRO-1 data-parallelism and context parallelism via ring-attention (Figure 10), with activation and memory reductions from CCA allowing context extension up to 32k without performance degradation. Figure 9

Figure 9: Three-phase pretraining schedule integrating curriculum, context extension, and specialized data upweighting.

Figure 10

Figure 10: Context parallelism sharding, load-balanced for long context with ring-based point-to-point communication.

A custom SendRecv pattern is implemented for the Muon optimizer step to ensure memory efficiency under a sharded parameter regime (Figure 11), eliminating the costly memory spikes of AllGather. Figure 11

Figure 11: Memory-efficient SendRecv strategy for Muon optimizer under ZeRO-1 sharding.

Training infrastructure is bolstered by the Aegis fault-tolerance system (Figure 12), supporting dynamic failure recovery, node triage, and high uptime throughout long runs. Distributed checkpointing is optimized for rapid persistence and world-size reshaping. Figure 12

Figure 12: Four-stage Aegis control plane for automated artifact monitoring, error classification, and dynamic recovery orchestration.

Kernel Fusion and Performance Engineering

Custom HIP kernels are developed for RMSNorm/LayerNorm (fused add/stats/norm/affine), as well as multi-tensor fused Muon steps with Newton-Schulz iterations. The training iteration profile (Figure 13) shows most compute is dominated by GEMM operations, with optimizer and communication overhead substantially reduced via kernel specialization. Figure 13

Figure 13

Figure 13: Operation breakdowns demonstrate dominant GEMM costs and minor optimizer/comm overhead post-fusion.

Empirical Results and Model Evaluation

ZAYA1-base exhibits benchmark results competitive with or exceeding much larger models: it matches Qwen3-4B and Gemma3-12B on MMLU-pro and mathematics/coding benchmarks, and outperforms Llama-3-8B and OLMoE (Figure 14). Pass@64 reasoning results are strong, approaching SOTA in base model capabilities without post-training. The model's design demonstrates high efficiency per active parameter, especially in data-scarce and compute-constrained environments. Figure 14

Figure 14: ZAYA1-base performance scaling compared against leading base models on TTFT and MMLU-pro.

Practical and Theoretical Implications

The work establishes that AMD MI300X and Pollara network stacks are fully mature and robust for production-scale LLM training, with ROCm-based software ecosystems supporting direct migration of existing training pipelines. The architectural advances in ZAYA1—CCA-driven attention, expressive routing, and residual scaling—raise the achievable efficiency frontier in MoE and suggest a trend away from excessive expert proliferation and residual expert reliance. The divergence from high top-kk in expert routing, enabled by sufficient router expressivity, contradicts prevailing MoE dogma.

The detailed methodology and open insights into systems, hardware benchmarking, and training infrastructure provide a practical guide for future AMD-based foundation model development. Emerging theoretical opportunities include further optimization of expert specialization, router architecture, and context extension mechanisms suited to high-bandwidth, large-memory GPU clusters.

Conclusion

This paper demonstrates that large-scale foundation model training is feasible and performant on AMD MI300X/Pollara platforms, matching or exceeding leading benchmarks with efficient, novel architecture design and thorough systems engineering. The presented techniques for kernel fusion, fault-tolerance, and communication optimization are instrumental for scaling robustness. ZAYA1's architectural choices and empirical results highlight new directions for MoE specialization, sparse inference, and compute/data-efficient pretraining—substantially advancing both practical deployment and future research in high-throughput AI systems.

Reference: "Training Foundation Models on a Full-Stack AMD Platform: Compute, Networking, and System Design" (2511.17127)

Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Explain it Like I'm 14

What is this paper about?

This paper is like a field report from engineers who trained a large AI LLM using only AMD hardware (AMD MI300X GPUs and AMD’s Pollara network). They didn’t just train a model—they carefully measured how fast the hardware, memory, and network really are for this kind of work, tuned their software to match, and built a model called ZAYA1 that performs competitively with well-known models. Their goal was to show that the “full-stack” AMD setup is ready for serious, large-scale AI training.

What questions were they trying to answer?

Here are the main questions the team set out to answer:

  • Can we train big, modern AI models efficiently using a complete AMD setup (GPUs, network, software)?
  • How fast are the key parts—GPU compute, GPU memory, and networking—when used the way AI training actually uses them?
  • What model shapes and settings (like layer sizes) work best on AMD MI300X GPUs?
  • What tricks and tools (like fault tolerance and smart saving) are needed to keep long training runs stable and efficient?
  • Can a model trained on this setup match or beat other strong models?

How did they do it?

To make this clear, think of the whole training system as a city with highways (networks), power plants (GPUs), and city planners (software). The authors:

  • Built the “city”: Each computer (node) has 8 AMD MI300X GPUs. Each GPU has its own super-fast network cable (a 400 Gbps Pollara NIC), like giving every car its own lane onto the highway. Inside a node, GPUs talk over a fast local link (InfinityFabric). There are two separate networks: one for training “chatter” between GPUs (the Pollara fabric) and one for file I/O like loading data and saving checkpoints (like having a separate road for delivery trucks so they don’t block commuters).
  • Timed everything with “stopwatches” (microbenchmarks):
    • Memory bandwidth: How fast can the GPU read/write its HBM memory? They used realistic tests (not just special show-off tests) to match real training behavior.
    • Compute speed: They tested many sizes of matrix multiplications (GEMMs), which are the bread-and-butter math of AI models, and picked the fastest algorithms for each size.
    • Network “group calls” (collectives): They measured how fast standard operations like all-reduce and all-gather run when many GPUs share information. They varied message sizes and number of GPUs to find the sweet spots.
  • Designed the model to fit the hardware (like tailoring a suit):
    • They created ZAYA1, a mixture-of-experts (MoE) transformer. Think of MoE like a team of specialists: for each token, the model picks the best expert to handle it rather than making everyone work on everything.
    • They used CCA (Compressed Convolutional Attention), which does attention in a smaller “compressed” space, like taking notes instead of copying entire pages. This cuts compute and memory, especially for long inputs.
    • They improved the “router” (the part that picks which expert handles which token) by using a small neural network (an MLP) with a memory of past choices (EDA). This helps specialists truly specialize and stay balanced.
    • They added residual scaling, like giving each layer a volume knob to control how much information passes through.
    • They tuned sizes (like hidden dimensions and head counts) to the shapes that AMD hardware likes best.
  • Kept training stable and efficient:
    • They used Megatron-LM adapted to AMD’s ROCm stack, plus tools from AMD’s Primus.
    • They used the Muon optimizer, which is memory-friendlier than AdamW and works well at large batch sizes.
    • They used simple data parallel training first, and later “context parallelism” (splitting the long input across GPUs) for very long sequences.
    • They built practical tools for fault tolerance and “checkpoint reshaping” (think game save files that still work even if you change how the game is split across GPUs).

What did they find?

Below are the highlights of what mattered and what worked best:

  • AMD end-to-end works at scale: The AMD GPUs (MI300X), networking (Pollara), and software (ROCm stack) are good enough today for competitive large-scale training.
  • Memory matters, and they measured it realistically: They showed what sustained HBM memory bandwidth looks like for real AI workloads, not just perfect lab tests. This helps predict actual training speed.
  • Bigger, well-shaped matrix multiplies are faster: On MI300X, large GEMMs reach peak speed more easily, especially when the “inner size” (K) is bigger. They created lookup tables so the fastest algorithm is chosen automatically for each size.
  • Network efficiency depends on message size:
    • Small messages are slow because of fixed overhead (latency), like waiting at many traffic lights.
    • Bigger messages fill the “pipe” and hit max bandwidth.
    • During training, they set the “fusion buffer” (how many gradients to bundle together) right at the point where bandwidth saturates—big enough to be fast, but not so big it’s hard to overlap with compute.
  • Inside a node, AMD’s GPU link (xGMI) likes full participation: You get the best speed when all 8 GPUs in a node join a collective together. This influenced how they split the work (parallelism) so they avoid patterns that waste bandwidth.
  • CCA attention helped a lot: Because CCA compresses attention, they cut both compute and memory, making long-context training (up to 32k tokens here) much cheaper. Training at 32k was almost as efficient as at 4k.
  • Simple parallelism was enough, thanks to big GPU memory: Each MI300X has 192 GB HBM, so they could train with straightforward data parallelism for 4k context. For longer context, they added context parallelism without needing heavy tensor/expert parallelism across nodes.
  • MoE design choices paid off:
    • They used many small experts (fine-grained) with top-1 routing (pick one best expert), guided by a smarter router.
    • This improved training speed and inference latency (responses are faster when only one expert runs).
    • The router improvements made expert balancing easier and improved performance.
  • ZAYA1 performed strongly:
    • ZAYA1-base has about 760 million “active” parameters per token (because of top-1 experts) and 8.3 billion total parameters overall.
    • It matches or beats models like Qwen3-4B and Gemma3-12B at similar or larger scales, and outperforms Llama-3-8B and OLMoE on reasoning, math, and coding tests.

Why does this matter?

This work is important for three big reasons:

  • More choice in AI hardware: It shows that AMD’s full stack—GPUs, networking, and software—is ready for large-scale AI training, giving companies and researchers a real alternative to NVIDIA-based systems.
  • Practical playbook for builders: The paper shares hard-won lessons on how to measure real performance, how to size model layers for the hardware, how to tune network message sizes, and how to keep long training jobs stable. This helps others avoid common pitfalls and get better speed for the same cost.
  • Better models, cheaper and faster: Designs like CCA and a smarter MoE router let models handle long inputs and tough tasks more efficiently. That means faster training, quicker responses at inference time, and potentially lower costs—useful for everything from coding assistants to long-document understanding.

In short, the team shows that with smart engineering and model design, AMD hardware can train competitive AI models at scale. They provide clear measurements, practical rules of thumb, and a promising model (ZAYA1) that benefits from these choices—and that paves the way for even longer contexts and better performance in future work.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Knowledge Gaps

Unresolved gaps, limitations, and open questions

Below is a concise list of concrete gaps the paper leaves open. Each item highlights what is missing or uncertain and points to actionable next steps for future work.

  • Reproducibility artifacts are not provided: the internal Megatron-LM fork, ROCm patches, RCCL configuration, fused kernels (RMSNorm, optimizer), microbenchmark code, static GEMM tuning tables, and exact training configs (optimizer hyperparameters, fusion buffer sizes, RCCL env vars) are not released.
  • Dataset composition and processing are under-specified: precise mixture ratios per phase, deduplication, filtering, quality scoring, multilingual balancing, contamination control with evaluation sets, and licensing/ethics safeguards are not described.
  • Evaluation details are incomplete: the paper claims competitive performance but does not provide a full, reproducible benchmark suite (task lists, prompts, decoding params, seeds), statistical variance, or ablation-backed evidence across reasoning, math, coding, multilingual, and long-context tasks.
  • Long-context capabilities are asserted but not validated: no quantitative evaluation at 32k (or further) is shown; the path to 128k is discussed but not empirically demonstrated; effects of RoPE base extension to 1M on retention and stability are unreported.
  • Lack of end-to-end throughput/efficiency numbers: tokens/sec per GPU, iteration-time breakdowns with overlap, GPU utilization, and overall compute efficiency (vs hardware peaks) are not reported, making it hard to verify “competitive pretraining” claims.
  • Energy/cost-efficiency unreported: no measurements of power draw, energy per token, thermal behavior, or TCO comparisons versus NVIDIA platforms.
  • MoE capacity management unspecified: capacity factor, token dropping policy (and frequency), overflow handling, and their impact on quality/efficiency are not described for top-1 routing.
  • Router ablations are missing: there is no quantitative comparison of the ZAYA1 router (MLP + EDA + PID-style balancing) against linear routers across scales, nor sensitivity to router dimension R, depth, EDA weight γ, PID hyperparameters, or balancing time-constants.
  • Expert specialization evidence is limited: the paper asserts increased specialization but provides no metrics (expert entropy, mutual information, specialization clustering, expert drift over training, specialization persistence across domains).
  • Stability under non-stationary data mixes is unexplored: how the router’s PID balancing reacts to rapid distribution shifts, curriculum transitions across phases, or long-context phases is unknown.
  • MoE scalability without expert-parallel all-to-all is untested: ZAYA1 uses local experts (no inter-node all-to-all). It remains unclear how AMD Pollara/RCCL perform for expert-parallel all-to-all at larger MoE scales or higher top-k.
  • Context-parallel (CP) design leaves training-time limitations unresolved: ring attention induces intra-node P2P traffic that underutilizes xGMI; a collective-friendly CP for training (analogous to tree attention) is not presented or benchmarked.
  • Inference performance is not quantified: latency/throughput vs batch size, prompt length, and output rate; KV-cache memory footprint; top-1 vs top-2 routing trade-offs; serving stack details; dynamic batching; and quantization support on MI300X are not provided.
  • CCA design space is underexplored: no systematic sweep over query/KV compression ratios, kernel sizes, grouped/depthwise conv choices, or head fraction choices; trade-offs in quality vs speed/KV compression across sequence lengths are not shown.
  • CCA failure modes are uncharacterized: impact on in-context learning, long-range recall beyond 32k, cross-attention behavior under extreme sparsity, and sensitivity to token distributions remain open.
  • GEMM tuning robustness is unclear: static lookup tables may be brittle across ROCm/firmware versions and shape drift due to MoE imbalance; there’s no evaluation of dynamic tuning overhead, cold-start behavior, or fallback performance.
  • HBM bandwidth measurement generality is limited: benchmarks focus on contiguous memcpy-like patterns; non-contiguous, strided, or scatter/gather access patterns common in modern kernels (e.g., FlashAttention, fused ops) are not profiled.
  • RCCL microbenchmarks omit all-to-all and tail behavior: only core collectives (all-reduce, reduce-scatter, all-gather, broadcast) are shown; all-to-all performance, tail latency under load, multi-job interference, and failure-induced retransmissions are not evaluated.
  • Rails-only topology trade-offs are not quantified end-to-end: the penalty vs Clos under real training workloads, sensitivity to cross-rail traffic, congestion hotspots, and job placement strategies are not analyzed.
  • Fault tolerance lacks quantitative validation: checkpoint size/throughput, reshape latency, recovery time objective (RTO), failure injection experiments (e.g., GPU/node loss mid-step), checkpoint integrity checks, and metadata/versioning resilience are not reported.
  • ZeRO stage choices are narrow: only ZeRO-1 is used; trade-offs versus ZeRO-2/3 on MI300X (with lower HBM or larger models) and their RCCL traffic patterns are untested.
  • Global batch scheduling is not ablated: increasing from 16M to 30M tokens improves throughput, but effects on generalization, optimization stability (with Muon), and per-expert gradient noise scale are not quantified.
  • Muon optimizer analysis is incomplete: computational overhead of 5 Newton–Schulz iterations, sensitivity to iteration count, numerical stability in bf16, memory/computation savings vs AdamW, and quality/efficiency trade-offs across tasks and scales are not provided.
  • Precision modes are limited: no results with FP8 (supported on MI300X), mixed-precision variants, or stochastic rounding—potentially leaving significant performance on the table.
  • Portability to smaller-memory GPUs is not addressed: the design depends on 192 GB HBM (ZeRO-1 only, large microbatches). Guidance for 80–120 GB-class GPUs (e.g., MI250, A100/H100) is missing.
  • Multi-tenant and heterogeneous cluster behavior is untested: Pollara/RCCL performance under co-scheduled jobs, mixed message sizes, and overlapping training/inference workloads is not assessed.
  • Software maturity caveats are not enumerated: ROCm/RCCL “nuances that materially affect throughput” are mentioned but not listed concretely (e.g., required env vars, known bugs/workarounds, topology hints), limiting transferability.
  • Data I/O pipeline lacks measurement: end-to-end dataloader throughput, prefetch depth, NVMe/OS cache effects, compression formats, CPU pinning, and the achieved overlap with GPU compute are not reported.
  • Security/compliance considerations for data are absent: PII handling, redaction, and auditability of dataset curation are not documented.
  • Scaling laws and compute–data trade-offs are not studied: no empirical scaling curves vs parameters/tokens; no analysis of small-MoE vs dense models at fixed FLOPs on this hardware.
  • Generality across domains and languages is unclear: although multilingual data is included, there is no domain- or language-specific evaluation to confirm benefits or detect regressions.
  • Comparison to NVIDIA baselines is missing: without side-by-side end-to-end training/inference and network comparisons (e.g., H100 + IB 400G), it is hard to contextualize the “mature and optimized” claim for the AMD stack.
  • Kernel availability and integration gaps remain: it is unclear whether all kernels (FlashAttention variants, convs for CCA, fused norms, fused optimizer ops) are production-ready in ROCm TE or require custom forks.
  • Expert count/topology design rules are incomplete: guidance for expert count per layer, local vs global experts, and how to transition to expert-parallel all-to-all as model size grows is not provided.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Practical Applications

Practical, real-world applications derived from the paper

Below are actionable applications grounded in the paper’s findings, methods, and system innovations. Each item includes sector linkages, potential tools/products/workflows, and key assumptions or dependencies.

Immediate Applications

  • AMD-first LLM pretraining pipelines for industry AI teams
    • Sector: software, cloud/AI infrastructure
    • What: Port NVIDIA workflows to ROCm by replicating the paper’s end-to-end training stack (Megatron-LM fork + AMD Primus components, RCCL communication, hipBLASLt/rocBLAS GEMM tuning, Muon optimizer).
    • Tools/workflows: ROCm, RCCL, Primus (monitoring + kernels), PyTorch TunableOp, TransformerEngine, hipBLASLt-bench, fused RMSNorm/LN, hardening fault-tolerance and checkpoint reshaping services.
    • Assumptions/dependencies: Access to MI300X GPUs and Pollara networking; ROCm maturity; internal engineering capacity for kernel selection/tuning; robust data pipelines.
  • Cost-efficient rails-only network design for large-scale training clusters
    • Sector: data center/HPC, cloud infrastructure
    • What: Adopt the two-level rails-only topology with per-GPU Pollara NICs to reduce switch cost and wiring complexity, separating training fabric from VPC/storage networks to avoid IO contention.
    • Tools/workflows: Pensando Pollara 400Gbps per-GPU NICs, dual fabric (training vs VPC/storage), GPU-direct RDMA, cluster placement that minimizes cross-rail traffic.
    • Assumptions/dependencies: Accurate mapping of parallelism topology to rails-only fabric, sufficient message sizes to saturate bandwidth, appropriate collective algorithm selection.
  • Performance tuning via AMD-specific collective and memory microbenchmarks
    • Sector: software, HPC
    • What: Use the paper’s microbenchmarks to calibrate fusion buffer sizes (e.g., gradients) to the bandwidth “saturation point,” select two-level collective algorithms, and avoid small-message latency regimes on Pollara/InfinityFabric.
    • Tools/workflows: RCCL collectives (all-reduce, reduce-scatter, all-gather), intra-node xGMI-aware collectives, PyTorch memcpy kernel tuning, BabelStream-like memory access benchmarking.
    • Assumptions/dependencies: Awareness of xGMI constraints (full-node participation for max bandwidth); accurate fusion-buffer sizing; observability of iteration breakdowns.
  • Hardware-aware transformer sizing and kernel shape optimization
    • Sector: software, robotics (onboard compute), energy (edge analytics), finance (low-latency inference)
    • What: Apply MI300X-aware sizing (GEMM shapes, divisibility constraints, batch scheduling) to maximize BF16 throughput and training stability; target ~200 GFLOPs per GEMM for peak throughput.
    • Tools/workflows: Static lookup tables for GEMM algorithm selection (hipBLASLt/rocBLAS), sizing rules for hidden/head dimensions, vocabulary divisibility, batch×sequence constraints, minimizing tensor/expert parallel degrees.
    • Assumptions/dependencies: Stable static tuning tables; predictable MoE balancing bands; adherence to divisibility rules to avoid kernel performance cliffs.
  • Deploy small-MoE, top-1 routing for faster, cheaper inference
    • Sector: software/SaaS, consumer apps, customer support
    • What: Use the paper’s “MoE++” design (fine-grained experts, top-1 routing) to cut inference latency and cost versus larger top-k MoEs or dense models while maintaining quality.
    • Tools/workflows: ZAYA1 router (MLP + EDA + PID balancing), removal of residual experts, lightweight residual scaling for stability.
    • Assumptions/dependencies: Router implementation quality; robust balancing across microbatches; minimal expert skew; availability of optimized top-1 dispatch kernels.
  • Long-context training and inference using CCA + context parallelism
    • Sector: education (paper assistants), legal (document analysis), healthcare (clinical notes), enterprise knowledge management
    • What: Train/infer at 32k+ context with Compressed Convolutional Attention (CCA) to reduce prefill compute and KV cache size; pair with context parallelism for predictable scaling; use tree attention for long-context inference on xGMI.
    • Tools/workflows: CCA (query/kv compression, conv blocks), ring/tree attention, RoPE base extension (10k→1M), staged context extension schedule, CP tuned to CCA.
    • Assumptions/dependencies: Quality CCA implementation; careful CP/message shaping to avoid xGMI bottlenecks; appropriate long-context datasets (books, legal briefs); scheduler changes for stability.
  • Fault-tolerance and checkpoint reshaping at scale
    • Sector: cloud/AI ops (MLOps, AIOps)
    • What: Adopt the paper’s practical fault-tolerance setup and “reshape service” to switch parallelism regimes mid-run; accelerate checkpoint writes to reduce IO-induced training stalls.
    • Tools/workflows: Checkpoint reshaping between ZeRO-1 and CP; accelerated PyTorch checkpoint IO; separate VPC fabric for storage; monitoring of iteration-time breakdown (compute vs comms vs IO).
    • Assumptions/dependencies: Reshape tooling integration with training framework; consistent optimizer state sharding; storage node bandwidth (200Gbps SmartNIC/DPUs).
  • Domain fine-tuning on ZAYA1-base for math, coding, and reasoning
    • Sector: software, finance (quant research), education (tutoring), robotics (task planning via language)
    • What: Use ZAYA1-base (8.3B total / 760M active parameters) as a cost-effective base for domain-specific adapters/fine-tunes given its strong math, reasoning, and coding performance.
    • Tools/workflows: LoRA/adapters atop MoE++ architecture; curated domain datasets; post-training SFT/RL; evaluation on domain benchmarks.
    • Assumptions/dependencies: Licensing/availability of ZAYA1-base; continued performance parity in downstream tasks; robust evaluation and safety guardrails.
  • Procurement and policy guidance for vendor diversification
    • Sector: public sector, national labs, enterprise procurement
    • What: Use the case paper to inform AMD-based AI infrastructure purchases, stressing separated fabrics (training vs storage), per-GPU NICs, and rails-only topologies for cost-performance balance.
    • Tools/workflows: RFP templates that include ROCm readiness, collective benchmark requirements, and cluster topology constraints; SLAs covering microbenchmark-backed throughput targets.
    • Assumptions/dependencies: Supply chain availability of MI300X/Pollara; organization’s ROCm readiness; staffing/training for AMD stack operations.

Long-Term Applications

  • Automated hardware-aware model sizing toolchains
    • Sector: software tooling, cloud AI platforms
    • What: Productize the paper’s sizing rules into automated tools that generate model configs (hidden sizes, head dimensions, vocab divisibility) and static GEMM tuning tables for a given hardware/cluster topology.
    • Tools/workflows: Integration with PyTorch, HIP/ROCm; autotuning services that learn optimal algorithms per shape; CI pipelines to refresh tuning as ROCm evolves.
    • Assumptions/dependencies: Stable APIs across ROCm releases; broad hardware coverage (AMD consumer/server GPUs); continuous benchmarking infrastructure.
  • Standardized ROCm-first collective algorithms and CP libraries
    • Sector: open-source ecosystems, HPC
    • What: Develop collective algorithms purpose-built for xGMI constraints (full-node participation), plus tree-attention libraries optimized for AMD, making long-context inference robust without bespoke tuning.
    • Tools/workflows: RCCL enhancements, tree-attention primitives adapted to AMD fabrics, message-pipelining strategies for Pollara saturation.
    • Assumptions/dependencies: Community adoption; upstreaming into PyTorch/Transformers; cooperation with AMD on firmware/drivers.
  • Cloud services offering managed AMD LLM training/inference
    • Sector: cloud providers, AI platforms
    • What: Commercial offerings that package MI300X clusters with Pollara networking, preloaded ROCm stacks, and “MoE++” recipes (CCA, ZAYA1 router, residual scaling, Muon) for turnkey training at scale.
    • Tools/workflows: Managed Megatron-LM forks, Primus-based monitoring, prebuilt fault-tolerance/reshape services, curated long-context datasets.
    • Assumptions/dependencies: Market demand; SLAs around performance and availability; multi-tenant isolation for rails-only topologies.
  • Energy-efficient, low-latency inference via small-MoE on edge/embedded
    • Sector: robotics, IoT, autonomous systems
    • What: Translate small-MoE top-1 routing and residual scaling to embedded AMD GPUs or future edge accelerators for real-time language reasoning and planning.
    • Tools/workflows: Lightweight MoE dispatch kernels; quantization/post-training compression; latency-aware batching; cache management using CCA-like compression.
    • Assumptions/dependencies: Edge hardware support for ROCm-like stacks; efficient MoE runtimes; memory bandwidth sufficiency on embedded devices.
  • Long-context assistants for professional workflows
    • Sector: legal, healthcare, finance, education
    • What: Assistants that handle book-length inputs (case law, medical records, financial filings, textbooks) by leveraging CCA + CP + tree attention; integrate domain-specific fine-tuning and safety filtering.
    • Tools/workflows: 32k–128k context models; retrieval-augmentation aligned with compressed KV caches; audit trails and explainability for regulated domains.
    • Assumptions/dependencies: Reliable long-context generalization; robust factuality and privacy controls; regulatory approvals in sensitive sectors.
  • Muon-first optimizers for ultra-large-batch training at scale
    • Sector: cloud/HPC AI training
    • What: Broad adoption of second-order-inspired optimizers (e.g., Muon) with Newton–Schulz iterations to push batch sizes upward, improving utilization on very large clusters.
    • Tools/workflows: Unified LR schemes across parameter types; optimizer-specific kernel fusion (matrix–matrix transpose); automated stability diagnostics.
    • Assumptions/dependencies: Continued evidence of convergence/stability; wide support in frameworks; handling optimizer state memory in extreme-scale runs.
  • Vendor-neutral benchmark suites and procurement standards
    • Sector: policy, standards bodies, enterprise IT
    • What: Formalize collective/memory/GEMM microbenchmarks, iteration breakdown reporting, and rails-only topology best practices into vendor-neutral standards for AI infrastructure evaluation.
    • Tools/workflows: Public benchmark repositories; certification programs; reporting templates for throughput/latency across message sizes and node counts.
    • Assumptions/dependencies: Cross-vendor collaboration; updates as hardware/software stacks evolve; alignment with public-sector procurement needs.
  • Cross-fabric algorithm co-design (compute–network aware models)
    • Sector: research, hardware-software co-design
    • What: New training/inference algorithms that explicitly optimize for the compute/memory/network rooflines (e.g., dynamic adjustments to top-k, expert widths, attention compression) based on real-time cluster telemetry.
    • Tools/workflows: Runtime schedulers that adjust model knobs per iteration; telemetry-driven collective selection; adaptive fusion-buffer sizing.
    • Assumptions/dependencies: High-fidelity monitoring; safe dynamic reconfiguration; guarantees on training stability.
  • Open, reproducible AMD training curricula for academia
    • Sector: academia, workforce development
    • What: University courses and labs built around ROCm, RCCL, GEMM tuning, CP/tree attention, and MoE++ architectures to train the next generation of system-aware ML engineers.
    • Tools/workflows: Teaching materials, open datasets, lab exercises replicating the paper’s methodology (HBM bandwidth measurement, collective scaling, sizing sweeps).
    • Assumptions/dependencies: Access to AMD hardware or cloud credits; sustained community support; modular course content.
  • Government and enterprise resilience through hardware diversification
    • Sector: public sector, large enterprises
    • What: Strategic diversification away from single-vendor dependence by establishing AMD-based AI centers of excellence, reducing supply-chain risk while maintaining competitive model performance.
    • Tools/workflows: Programmatic investment, ROCm enablement training, standardized migration playbooks, benchmarking KPIs for continuous improvement.
    • Assumptions/dependencies: Budget and organizational will; vendor support; measurable ROI compared to incumbent stacks.
Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Glossary

  • Activation memory: The portion of memory used to store intermediate activations during forward/backward passes, often a bottleneck at long contexts. Example: "The substantial reductions in attention FLOPs and activation memory that CCA provided also made context extension significantly easier than for alternative attention methods"
  • All-gather: A collective communication operation that gathers data from all processes to all processes. Example: "microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast)"
  • All-reduce: A collective communication that aggregates values (e.g., sums) across processes and distributes the result back to all. Example: "microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast)"
  • BF16 (bfloat16): A 16-bit floating point format with a wide exponent for training efficiency on modern accelerators. Example: "Most of a model's iteration time is spent performing BF16 GEMM kernels (see Figure~\ref{fig:iter-breakdowns})."
  • Broadcast: A collective operation that sends data from one process to all others. Example: "microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast)"
  • Bus bandwidth: The effective data transfer rate measured across a communication bus in benchmarks. Example: "(a) AllReduce Bus Bandwidth"
  • CCA (Compressed Convolutional Attention): An attention mechanism that performs sequence-mixing in a compressed latent space to reduce compute and KV-cache size. Example: "Compressed Convolutional Attention (CCA) \citep{cca}, which executes attention in a compressed latent space to reduce both prefill compute and KV-cache size;"
  • CCGQA: A combined/variant attention configuration that integrates compression with grouped-query techniques. Example: "For attention, we utilized CCGQA with a query compression rate of 2×2\times and a KV compression rate of 8×8\times."
  • Clos topology: A multi-stage, fat-tree network topology used for high-bandwidth, non-blocking switching. Example: "Instead of the classical Clos \citep{clos} topology, individual nodes are interconnected in a rails-only topology"
  • Composable Kernel (CK): An AMD library of specialized GPU kernels for high-performance workloads. Example: "languages and libraries such as Heterogeneous-Compute Interface for Portability (HIP) \citep{amd_hip}, or Composable Kernel (CK) \citep{amd_composable_kernel}."
  • Context parallelism (CP): A parallelization strategy that partitions the sequence/context across devices to scale long-context training/inference. Example: "Beyond the core model architecture, our training run used context parallelism (CP) (Section \ref{sec:context-parallelism}) tailored to CCA"
  • Data parallelism: A distributed training approach that replicates the model across devices and splits batches of data among them. Example: "We pretrained solely using data-parallelism with the ZeRO-1 distributed optimizer"
  • DPU (Data Processing Unit): A programmable network processor offloading networking/storage tasks from CPUs/GPUs. Example: "The VPC network connects each node via Pensando DSC SmartNIC/DPUs to a separate leaf--spine fabric"
  • EDA (Exponential Depth Averaging): A mechanism that mixes routing information across layers with a learned coefficient to stabilize routing. Example: "using a mechanism we call Exponential Depth Averaging (EDA), since it represents an improvement to Depth-Weighted Averaging"
  • Expert-parallelism: Parallelization that partitions Mixture-of-Experts across devices, routing tokens to different experts. Example: "such as expert- and tensor-parallelism \citep{gshard,megatron,anthony2024comms}"
  • Flash Attention: An IO-aware attention algorithm optimized to reduce memory traffic and improve speed. Example: "Even IO-aware attention algorithms like Flash Attention \citep{dao2023flashattention2} are bottlenecked by the bandwidth of their HBM accesses until they reach higher sequence lengths."
  • FLOPs: Floating-point operations, a measure of compute used or required by a model. Example: "CCA also improves training speed vs GQA and MLA and significantly reduces the FLOPs required for prefill"
  • Fusion buffer: A buffer that aggregates many small tensors (e.g., gradients) into a larger message to improve communication efficiency. Example: "Distributed training frameworks such as ours provide a fusion buffer to batch gradient tensors into, and then communicate the fused buffer."
  • GEMM (General Matrix Multiplication): A core linear algebra operation central to deep learning compute. Example: "boil down those ops into their component set of memory accesses and General Matrix Multiplications (GEMMs)"
  • GQA (Grouped-Query Attention): An attention variant that shares key/value across groups of query heads to reduce memory and compute. Example: "such as MLA and GQA \citep{ainslie2023gqa,deepseekv3}"
  • GPU-direct RDMA: A technology enabling direct GPU memory access over the network, reducing CPU involvement and latency. Example: "software that heavily relies upon GPU-direct RDMA."
  • HBM (High Bandwidth Memory): On-package memory providing very high bandwidth for GPUs. Example: "MI300X GPUs have 192\,GB HBM"
  • HIP (Heterogeneous-Compute Interface for Portability): AMD’s CUDA-like API for GPU programming on ROCm. Example: "languages and libraries such as Heterogeneous-Compute Interface for Portability (HIP) \citep{amd_hip}"
  • hipBLASLt: A ROCm linear algebra backend providing tunable GEMM algorithms. Example: "ROCm provides multiple GEMM backends (rocBLAS and hipBLASLt), each containing many algorithms."
  • InfinityFabric: AMD’s high-bandwidth intra-node GPU interconnect. Example: "Each node contains 8 MI300X GPUs interconnected with InfinityFabric."
  • KV-cache: Cached key/value tensors from prior tokens to avoid recomputation during autoregressive inference. Example: "The KV-cache is a purely inference-time structure that stores the KV states of prior tokens so that they do not require recomputation for each subsequent token."
  • Latency: The fixed time overhead before data transfer or computation proceeds, dominating small-message operations. Example: "(d) AllReduce Latency"
  • Leaf–spine fabric: A two-tier data center network architecture providing predictable bandwidth and scale. Example: "to a separate leaf--spine fabric"
  • Megatron-LM: A large-scale transformer training framework optimized for model/data parallelism. Example: "Our core training framework is a forked internal version of Megatron-LM adapted for the AMD stack."
  • MLA: An attention variant referenced alongside GQA as a contemporary baseline. Example: "with MLA or GQA attention"
  • MoE (Mixture-of-Experts): An architecture that routes tokens to specialized expert sub-networks to improve parameter efficiency. Example: "We report on the first large-scale mixture-of-experts (MoE) pretraining study on pure AMD hardware"
  • MPI (Message Passing Interface): A standardized communication protocol for parallel computing. Example: "collective communication libraries (e.g. RCCL/NCCL/MPI) to design two-level collective algorithms"
  • Muon optimizer: A second-order-inspired optimizer using matrix preconditioning with Newton–Schulz updates. Example: "ZAYA1-base is trained using the Muon optimizer \citep{jordan6muon}."
  • NCCL (NVIDIA Collective Communications Library): NVIDIA’s library for multi-GPU collectives. Example: "collective communication libraries (e.g. RCCL/NCCL/MPI)"
  • Newton–Schulz iterations: Iterative method for approximating matrix inverses used in Muon’s preconditioning. Example: "We performed five Newton-Schulz iterations per gradient step."
  • NIC (Network Interface Card): Hardware that connects a device to a network fabric. Example: "Each GPU is assigned its own Pollara 400Gbps NIC"
  • NVLink: NVIDIA’s high-bandwidth GPU interconnect. Example: "such as AMD InfinityFabric \citep{amd2024cdna3} or NVIDIA NVLINK \citep{nvidia_nvlink}."
  • NVSwitch: A switch enabling all-to-all high-bandwidth connectivity among GPUs within a node. Example: "Most DL software makes the assumption of a switched intra-node topology like NVSwitch."
  • PID controller: A feedback-control mechanism using proportional, integral, and derivative terms, adapted here for load balancing. Example: "inspired by proportional–integral–derivative (PID) \citep{aastrom2006pid} controllers from classical control theory."
  • Pipeline parallelism: A model-parallel method that partitions layers across devices and pipelines microbatches. Example: "pipeline parallelism \citep{rajbhandari2020zero, gpipe}"
  • Pollara interconnect: AMD Pensando high-speed network fabric for GPU clusters. Example: "utilizing both MI300X GPUs with Pollara interconnect."
  • Prefill: The initial inference phase processing the prompt before generating the first token. Example: "We use the term prefill to denote the first phase of inference, where the model must first ingest the entire input sequence and generate the first token."
  • Primus framework: AMD’s ROCm-based tooling and kernels for large-model training. Example: "we integrated and utilized several components from AMD's Primus framework \citep{primus2025}"
  • Rails-only topology: A cost-efficient cluster network design trading path diversity for simpler wiring and aligned communication patterns. Example: "individual nodes are interconnected in a rails-only topology"
  • RCCL (Radeon Collective Communications Library): AMD’s collective communication library analogous to NCCL. Example: "RCCL collective operations performance across InfinityFabric within a node."
  • Reduce-scatter: A collective that reduces data across processes and scatters partitions of the result to each. Example: "microbenchmarks for all core collectives (all-reduce, reduce-scatter, all-gather, broadcast)"
  • Residual scaling: Learned per-layer scaling and biasing of the residual stream to control information flow. Example: "The final architectural innovation in ZAYA1-base is residual scaling."
  • Ring attention: A sequence-parallel attention algorithm using ring-style communication. Example: "ZeRO-1 + context parallelism (ring attention \citep{liu2023ring}) when scaling context up to 32768."
  • RMSnorm: A normalization layer that scales activations by their root-mean-square without centering. Example: "a fused RMSnorm/LN kernel"
  • RoPE (Rotary Positional Embeddings): A positional encoding scheme rotating query/key vectors by position-dependent phases. Example: "We applied RoPE \citep{su2023rotary} to half the channels in each head, leaving the other half without position embeddings."
  • Roofline model: A performance model relating compute and memory bandwidth limits to identify bottlenecks. Example: "The crossover point is dependent on the particular GPU's roofline"
  • Sharded optimizer states: Partitioning optimizer state tensors across devices to reduce memory usage. Example: "with the ZeRO-1 distributed optimizer, where only the optimizer states are sharded."
  • SmartNIC: A NIC with onboard processing (e.g., offloads), often used for storage/management traffic isolation. Example: "The VPC network connects each node via Pensando DSC SmartNIC/DPUs to a separate leaf--spine fabric"
  • Tensor-parallelism: Splitting tensor dimensions of layers across devices to increase model size without replication. Example: "such as expert- and tensor-parallelism \citep{gshard,megatron,anthony2024comms}"
  • Top-k routing: Selecting the k highest-scoring experts per token in MoE routers. Example: "The scores are then used to select the chosen expert via top-k operation:"
  • Tree attention: A communication-efficient attention method organizing exchanges in a tree rather than a ring. Example: "Long-context inference relies heavily on tree attention \citep{shyam2024tree} to avoid the bandwidth bottleneck of xGMI."
  • Two-level collective algorithms: Collective strategies that exploit fast intra-node links to improve inter-node scaling. Example: "collective communication libraries (e.g. RCCL/NCCL/MPI) to design two-level collective algorithms that rely heavily on the intra-node fabric"
  • VPC network: A logically isolated network used here for storage and management traffic. Example: "while a VPC network manages dataset I/O, checkpoints, and cluster management."
  • xGMI: AMD’s GPU-to-GPU interconnect protocol used within InfinityFabric for intra-node communication. Example: "AMD's InfinityFabric uses xGMI \citep{xgmi}, which requires all GPUs participate in a given collective operation to achieve full bandwidth"
  • ZeRO-1: The first stage of the Zero Redundancy Optimizer that shards optimizer states across data-parallel ranks. Example: "our parallelism topology for ZAYA1 is just the ZeRO-1 distributed optimizer during initial pretraining at 4096 sequence length"
Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We found no open problems mentioned in this paper.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

Sign up for free to view the 8 tweets with 326 likes about this paper.

Youtube Logo Streamline Icon: https://streamlinehq.com