- The paper introduces a cost-effective Fire-Flyer AI-HPC system that attains 80% DGX-A100 performance at 60% of the cost while cutting energy use by 40%.
- It leverages a synergistic software-hardware co-design approach, including HFReduce for efficient allreduce and HaiScale for optimized parallel deep learning training.
- The architecture employs a two-layer Fat-Tree network and a high-performance 3FS distributed file system to enhance data communication and system reliability.
Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning
The research paper "Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning" presents an innovative approach to designing high-performance computing (HPC) systems catering to the growing demands of deep learning (DL) and LLMs. The paper addresses the cost and energy challenges associated with constructing AI infrastructure by proposing a synergistic hardware-software co-design framework, notably through the Fire-Flyer AI-HPC architecture.
Key Contributions
Fire-Flyer 2 AI-HPC Architecture
The Fire-Flyer 2 AI-HPC clusters utilize 10,000 PCIe A100 GPUs, achieving comparable performance to NVIDIA's DGX-A100 systems but at reduced costs and energy consumption. The architecture leverages cost-effective strategies to optimize deployment including:
- GPU Node Design: The system implements nodes composed of 8 NVIDIA A100 PCIe GPUs and 1 Mellanox CX6 200Gbps InfiniBand (IB) NIC. This design choice avoids the high costs associated with NVIDIA's SXM architecture by utilizing PCIe GPUs without significantly compromising performance.
- Network Topology: A Two-Layer Fat-Tree network architecture integrates storage and computation networks, enabling efficient data communication and reducing network congestion. The system utilizes service-level (SL) differentiation and static routing to manage network traffic effectively.
The architecture demonstrates significant cost and energy efficiency:
- Cost Reduction: The architecture achieves 80\% of the performance of DGX-A100 but at 60\% of the cost.
- Energy Efficiency: There is a 40\% reduction in energy consumption, contributing to lower operational costs and CO_2 emissions.
Software-Hardware Co-Design
The paper emphasizes the importance of optimizing both software and hardware components to address the specific needs of deep learning workloads. Key optimizations include:
- HFReduce: A custom library designed for efficient allreduce operations in large-scale DL training. HFReduce achieves better performance than NVIDIA's NCCL by leveraging asynchronous transfers and CPU-based reductions, which minimizes PCIe bandwidth consumption and avoids GPU kernel overhead.
- HaiScale: A framework that implements optimized parallel strategies for DL and LLM training, including Data Parallelism (DP), Pipeline Parallelism (PP), Tensor Parallelism (TP), and others. HaiScale overlaps computation and communication to maximize resource utilization.
- 3FS Distributed File System: A high-performance file system designed to handle the massive I/O demands of big data AI tasks. 3FS integrates closely with the network architecture to prevent congestion and ensure high throughput.
- HAI Platform: A scheduling platform that enhances resource utilization through time-sharing, task scheduling, fault handling, and disaster recovery mechanisms.
Stability and Robustness
Ensuring stability in large-scale HPC systems is paramount. The Fire-Flyer 2 AI-HPC architecture incorporates several mechanisms to enhance stability and minimize downtime:
- Checkpoint Manager: Implements efficient checkpointing to save and restore large DL models, thus minimizing the impact of hardware failures.
- Validator Utility: Regularly checks the hardware for potential issues, ensuring that only healthy nodes are utilized.
- Hardware Failure Insights: Detailed analysis of GPU and network failures, with insights into the most common issues (e.g., NVLink and ECC errors) and their mitigation.
Practical and Theoretical Implications
This paper contributes significantly to the field of HPC for AI by presenting a feasible alternative to expensive high-end solutions like NVIDIA's DGX systems. The cost-effective strategies proposed, specifically the use of PCIe GPUs and optimized network designs, can be applied to large-scale AI infrastructure to reduce costs and energy consumption while maintaining high performance.
On the theoretical side, the research on software-hardware co-design provides valuable insights into optimizing DL and LLM training at scale. The introduction of HFReduce and HaiScale showcases how computation and communication overlap can be achieved, leading to more efficient training processes.
Future Developments
The paper hints at future developments aimed at enhancing the architecture further:
- Next-Generation PCIe Architecture: The paper outlines plans for a new architecture aimed at Mixture-of-Experts (MoE) LLM training with a 1:1 GPU to NIC ratio and a multi-plane network design. This would allow for significant enhancements in all-to-all communication performance, crucial for LLM training.
- Integration with RoCE: Exploring the adoption of RoCE (RDMA over Converged Ethernet) switches to further reduce network costs while maintaining performance.
Conclusion
The Fire-Flyer AI-HPC architecture offers a compelling cost-effective solution for the growing computational demands of DL and LLMs, achieving high performance and energy efficiency through intelligent hardware-software co-design. The insights and methodologies presented in the paper can serve as a valuable reference for researchers and industry practitioners aiming to construct efficient and economical AI-HPC systems.