Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models (2102.02344v3)

Published 3 Feb 2021 in cs.LG

Abstract: Driven by the tremendous effort in researching novel deep learning (DL) algorithms, the training cost of developing new models increases staggeringly in recent years. We analyze GPU cluster usage statistics from a top research institute for more insights into the hardware efficiency achieved by typical DL training jobs. Our study reveals that single-accelerator training jobs can dominate the cluster-wide resource consumption when launched repetitively (e.g., for hyper-parameter tuning) while severely under-utilizing the hardware. Fortunately, we observe that such workloads have the following unique characteristics: (i) the models among jobs often have the same types of operators with the same shapes, and (ii) the inter-model horizontal fusion of such operators is mathematically equivalent to other already well-optimized operators. Thus, to help DL researchers and practitioners effectively improve the hardware utilization of their novel DL training workloads, we propose Horizontally Fused Training Array (HFTA). HFTA is a new DL framework extension library that horizontally fuses the models from different repetitive jobs deeply down to operators and then trains them simultaneously on a shared accelerator. To show the generality of our solution, we apply HFTA to six DL models training on state-of-the-art accelerators (GPUs and TPUs). Our results indicate that HFTA is highly effective in improving hardware utilization and achieves up to $15.1 \times$ higher training throughput vs. the standard practice of running each job on a separate accelerator.

Citations (19)

View on Semantic Scholar

Summary

The paper demonstrates that HFTA fuses similar operators from multiple DL jobs to enhance hardware utilization.
Experimental results reveal up to 15.1× speedup and significant GPU hour savings across various modern accelerators.
The study integrates HFTA with hyper-parameter tuning, offering a scalable solution to optimize deep learning training costs.

An Overview of Horizontally Fused Training Array for Efficient Deep Learning Model Training

The paper "Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models" presents an innovative approach to optimizing deep learning (DL) model training workflows on accelerators, such as GPUs and TPUs. The primary motivation for this research stems from the observed inefficiencies in resource utilization when DL training jobs are executed on a single accelerator. These inefficiencies are particularly pronounced in repetitive training scenarios, such as hyper-parameter tuning, where numerous models with slight variations are trained independently across multiple jobs.

Key Contributions

The research makes several significant contributions:

Cluster Resource Utilization Analysis: The paper provides an empirical analysis of GPU cluster usage statistics, revealing a predominance of single-accelerator training jobs that are inefficiently utilizing hardware resources. Approximately 46.2% of the cluster-wide GPU hours are consumed by such jobs, underscoring the need for a more efficient resource management solution.
Proposal of HFTA: The core contribution is the introduction of the Horizontally Fused Training Array (HFTA), a novel DL framework extension. HFTA leverages the observation that many training jobs consist of models that share identical operator types and dimensions. By horizontally fusing these operators across jobs, HFTA facilitates training multiple models simultaneously on a single accelerator, thereby enhancing hardware utilization.
Experimental Validation: The effectiveness of HFTA is demonstrated through extensive experiments on various DL models across state-of-the-art accelerators. The results show a significant improvement in training throughput, achieving up to 15.1× enhancement compared to conventional single-job-per-accelerator practices.
Integration with Tuning Algorithms: The paper also introduces Horizontally Fused Hyper-parameter Tuning (HFHT), a lightweight framework that integrates HFTA into existing hyper-parameter tuning workflows. HFHT optimizes the total GPU hour cost by enabling model fusion during the tuning process.

Methodology and Results

HFTA is built on the principle of deep fusion of model operators, allowing models to be trained in parallel using a shared computational approach. By transforming multiple models into a fused representation without altering the mathematical equivalence, HFTA avoids duplicating runtime and memory overheads associated with traditional hardware sharing approaches like Multi-Process Service (MPS) and Multi-Instance GPU (MIG).

The paper's experimental results are compelling across various hardware setups, including NVIDIA's V100, RTX6000, and A100 GPUs, as well as Google's TPU v3. Notably, on the A100 GPU, HFTA achieves up to 11.5× speedup for specific workloads. Additionally, the paper shows that HFTA can operate effectively without specialized device-specific operator implementations, maintaining generality and ease of integration with existing DL frameworks.

Implications and Future Developments

The implications of this research are both practical and theoretical. Practically, HFTA offers a substantial reduction in training costs, enabling researchers and practitioners to better utilize existing computational resources. Theoretically, HFTA presents a paradigm shift in approaching DL training optimization, moving away from model-centric performance tweaks toward job-centric resource management solutions.

Future research directions could explore further automation of the fusion process, extend the approach to more complex model architectures, and investigate adaptive fusion strategies tailored to particular hardware and workload characteristics. Moreover, integrating HFTA with other optimization techniques may provide additional performance gains.

In conclusion, the Horizontally Fused Training Array offers a significant advancement in addressing the pervasive issue of hardware under-utilization in DL model training, paving the way for more efficient and cost-effective deployment of DL workloads at scale.

PDF Markdown

Related Papers

YouTube

Show All Videos