High-performance and cost-efficient datacenter network architecture for large-scale LLM training

Determine a datacenter network architecture that simultaneously delivers high performance and cost-efficiency for large-scale large language model training workloads.

Background

The paper surveys common datacenter and HPC interconnect topologies (Clos, 3D Torus, Dragonfly, and Fugaku Tofu) and argues that none is ideally suited to the distinctive, locality-heavy and collective-communication–intensive traffic patterns of large-scale LLM training. Clos offers flexibility but is expensive due to extensive use of high-performance switches and optical modules; 3D Torus and similar torus-based designs reduce costs but provide lower NPU-to-NPU bandwidth and struggle with complex collectives like all-to-all; Dragonfly reduces some costs but is still expensive and performs poorly for LLM training workloads such as P2P and AllReduce.

Motivated by these trade-offs, the authors identify a gap: a network architecture that achieves both high performance and cost-efficiency specifically for large-scale LLM training remains unresolved. This motivates their proposal of UB-Mesh, but the stated gap is explicitly marked as an open problem in the text.

References

In summary, how to design a high performance and cost-efficient datacenter network architecture for large-scale LLM training is still an open problem.

UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture (2503.20377 - Liao et al., 26 Mar 2025) in End of Subsection 2.3 "Datacenter Network Architectures" (Section 2)