Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
131 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Scaling Intelligence: Designing Data Centers for Next-Gen Language Models (2506.15006v1)

Published 17 Jun 2025 in cs.AR, cs.AI, cs.DC, cs.ET, and cs.PF

Abstract: The explosive growth of LLMs - such as GPT-4 with 1.8 trillion parameters - demands a radical rethinking of data center architecture to ensure scalability, efficiency, and cost-effectiveness. Our work provides a comprehensive co-design framework that jointly explores FLOPS, HBM bandwidth and capacity, multiple network topologies (two-tier vs. FullFlat optical), the size of the scale-out domain, and popular parallelism/optimization strategies used in LLMs. We introduce and evaluate FullFlat network architectures, which provide uniform high-bandwidth, low-latency connectivity between all nodes, and demonstrate their transformative impact on performance and scalability. Through detailed sensitivity analyses, we quantify the benefits of overlapping compute and communication, leveraging hardware-accelerated collectives, wider scale-out domains, and larger memory capacity. Our study spans both sparse (mixture of experts) and dense transformer-based LLMs, revealing how system design choices affect Model FLOPS Utilization (MFU = Model flops per token x Observed tokens per sec / Peak flops of the hardware) and overall throughput. For the co-design study, we extended and validated a performance modeling tool capable of predicting LLM runtime within 10% of real-world measurements. Our findings offer actionable insights and a practical roadmap for designing AI data centers that can efficiently support trillion-parameter models, reduce optimization complexity, and sustain the rapid evolution of AI capabilities.

Summary

  • The paper presents a comprehensive co-design framework that jointly optimizes hardware and software to meet the demands of multi-trillion parameter language models.
  • It demonstrates that the FullFlat network achieves 50-70x speedups and over 70% Model FLOPS Utilization up to 65,536 GPUs, reducing complex tuning needs.
  • The analysis, validated with real system measurements, offers balanced recommendations across compute, memory, and network to optimize performance and TCO.

The paper "Scaling Intelligence: Designing Data Centers for Next-Gen LLMs" (2506.15006) addresses the critical need for rethinking data center architecture to support the immense scale and complexity of next-generation LLMs, such as multi-trillion parameter models like GPT-4. The core problem is that current data center designs often lead to suboptimal performance (low Model FLOPS Utilization - MFU) and struggle with the computational, memory, and communication demands of these models, making training excessively expensive and time-consuming.

To tackle this, the authors propose a comprehensive co-design framework that jointly explores hardware factors (compute FLOPS, High Bandwidth Memory - HBM capacity and bandwidth, network topologies, scale-out domain size) and software factors (various parallelism strategies like Data Parallelism (DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), Expert Parallelism (EP), Expert Sharding (ES), optimization techniques like activation recomputation, optimizer sharding, and compute-communication overlap).

A key contribution is the introduction and evaluation of a futuristic network architecture called "FullFlat". Unlike traditional two-tier networks with distinct high-bandwidth (scale-up/HBD) and low-bandwidth (scale-out/LBD) domains, FullFlat leverages advanced co-packaged optics (CPO) to provide uniform high-bandwidth, low-latency connectivity between all nodes in the cluster. The paper analyzes this against current (TwoTier-HBD8) and near-future (TwoTier-HBD64/128) two-tiered architectures.

The analysis is performed using a detailed, Python-based analytical performance modeling tool, which extends the open-source Calculon (2303.17783) framework. This extended tool specifically adds capabilities for modeling Mixture of Experts (MoE) models (including expert parallelism, dynamic routing, SwiGLU, expert communication) and various optimization techniques. The tool was validated against real system measurements (Megatron models on A100, Mistral models on H100) with an error margin of less than 10%, making it suitable for early-stage co-design exploration. The paper evaluates performance on clusters up to 65,536 GPUs using models like GPT4-1.8T (16 experts), GPT4-29T (128 experts), and GPT3-175B (dense).

Practical Implementation Details and Insights:

  1. Network Topology Impact:
    • Strong Scaling: The paper shows that future systems (TwoTier-HBD64+ and FullFlat) can achieve 50-70x speedups for models like GPT-1.8T compared to current TwoTier-HBD8 systems. FullFlat consistently demonstrates the best strong scaling, maintaining performance gains up to 65,536 GPUs, while two-tiered systems may see performance plateau or degrade due to communication bottlenecks.
    • FullFlat Advantages: FullFlat networks significantly reduce communication bottlenecks and improve MFU (demonstrated to reach 70%+). They enable applying tensor parallelism across node boundaries, which is typically inefficient in two-tiered systems. A major practical benefit is the reduced sensitivity to software optimizations like compute-communication overlap and hardware-accelerated collectives. The paper shows that with FullFlat, the performance gap between optimal and suboptimal configurations (top 5,000) is only about 5%, compared to up to 80% in two-tiered systems (Figure 1), easing performance tuning efforts for practitioners. FullFlat is also anticipated to offer TCO/power savings (20%+) and improved reliability due to the properties of CPO and a flatter topology.
  2. Sensitivity Analysis (using 8192 GPUs as an example):
    • HBD Size: For MoE models, increasing the HBD size dramatically improves performance up to the point where expert communication fits within the HBD. Beyond this, the benefit diminishes, as remaining communication is primarily data/pipeline parallelism over the slower scale-out network. This highlights the need to size the HBD appropriately based on the target MoE models (e.g., HBD=64 helps GPT-1.8T (16 experts), HBD=512 helps GPT-29T (128 experts) with SO=100GB/s).
    • Scale-up (SU) Bandwidth: Doubling SU bandwidth from 450 GB/s to 900 GB/s provides significant gains (up to 1.6x). Further increases continue to help, especially when TP/expert communication fits within the HBD. However, diminishing returns suggest balancing SU investment with SO bandwidth and HBD size.
    • Scale-out (SO) Bandwidth: Critically important when communication happens outside the HBD (e.g., expert communication in large MoE models that don't fit the HBD, or DP/PP communication). Doubling SO bandwidth provides noticeable gains (up to 1.36x).
    • Compute FLOPS: Increasing FLOPS per GPU improves performance, and this benefit is amplified by larger HBDs and higher bandwidths, underscoring the need for a balanced system design.
    • HBM Bandwidth: High HBM bandwidth is crucial for performance (up to 4.5x gain for GPT-1.8T with 16x BW increase).
    • HBM Capacity: Sufficient HBM capacity per GPU is vital. It reduces the need for aggressive parallelism (TP, ES) and techniques like recomputation or offloading, simplifying implementation and boosting performance (up to 4.9x gain for GPT-1.8T with sufficient capacity). The analysis suggests targeting ~1.3 TB/GPU for future data centers to efficiently run large models.
  3. Optimization Impact:
    • Compute/Communication Overlap: Overlapping TP and DP communication with computation provides performance benefits, especially as GPU count increases. Missing this can cause slowdowns of up to 15% in two-tiered networks but less in FullFlat (4-5%), indicating FullFlat's resilience.
    • Hardware Collectives: Hardware-accelerated collectives (like those in NVSwitch or InfiniBand/Ethernet smart switches) are important, providing over 16% performance improvement at scale by reducing communication overhead. FullFlat is less sensitive to missing this optimization (10-13% slowdown).
  4. Sparse vs. Dense Models: While sparse MoE models have higher communication intensity due to expert routing, the paper found that the dense GPT-3 model was surprisingly more sensitive to missing software optimizations (compute-comm overlap, hardware collectives) than the MoE models. This suggests optimizing software for diverse workloads remains important, though FullFlat mitigates sensitivity across the board.

Recommendations for Next-Gen AI Data Centers:

Based on the co-design analysis, the paper suggests future data centers for LLMs should target:

  • Network: High-radix, low-latency networks, ideally FullFlat with co-packaged optics, providing uniform high bandwidth (>= 1.6 TB/s in scale-up and >= 200 GB/s in scale-out for two-tiered systems). The HBD size in two-tiered networks should be large enough to accommodate common MoE expert configurations.
  • Compute: Target around 20 FP16 PF/s per GPU.
  • Memory: Ample HBM capacity (target ~1.3 TB/GPU) and Tier-2 memory (1-2 TB/GPU). High HBM bandwidth (>= 30 TB/s) and Tier-2 bandwidth (>= 256 GB/s).
  • Hardware Features: Support for hardware-accelerated collective operations.
  • Software: Optimized software stacks capable of selecting optimal parallelism strategies, maximizing compute-communication overlap, and leveraging hardware features. Utilizing tools for performance prediction and configuration search (like the extended Calculon) is recommended.

In conclusion, the paper provides a practical roadmap derived from extensive co-design exploration, emphasizing that a balanced approach across compute, memory, and revolutionary network architectures like FullFlat is essential for efficiently scaling intelligence to support future trillion-parameter LLMs. The FullFlat network, in particular, appears transformative, not only boosting raw performance but also significantly reducing optimization complexity.

X Twitter Logo Streamline Icon: https://streamlinehq.com