Dice Question Streamline Icon: https://streamlinehq.com

High-performance and cost-efficient datacenter network architecture for large-scale LLM training

Determine a datacenter network architecture that simultaneously delivers high performance and cost-efficiency for large-scale large language model training workloads.

Information Square Streamline Icon: https://streamlinehq.com

Background

The paper surveys common datacenter and HPC interconnect topologies (Clos, 3D Torus, Dragonfly, and Fugaku Tofu) and argues that none is ideally suited to the distinctive, locality-heavy and collective-communication–intensive traffic patterns of large-scale LLM training. Clos offers flexibility but is expensive due to extensive use of high-performance switches and optical modules; 3D Torus and similar torus-based designs reduce costs but provide lower NPU-to-NPU bandwidth and struggle with complex collectives like all-to-all; Dragonfly reduces some costs but is still expensive and performs poorly for LLM training workloads such as P2P and AllReduce.

Motivated by these trade-offs, the authors identify a gap: a network architecture that achieves both high performance and cost-efficiency specifically for large-scale LLM training remains unresolved. This motivates their proposal of UB-Mesh, but the stated gap is explicitly marked as an open problem in the text.

References

In summary, how to design a high performance and cost-efficient datacenter network architecture for large-scale LLM training is still an open problem.

UB-Mesh: a Hierarchically Localized nD-FullMesh Datacenter Network Architecture (2503.20377 - Liao et al., 26 Mar 2025) in End of Subsection 2.3 "Datacenter Network Architectures" (Section 2)