Benchmarking TPU, GPU, and CPU Platforms for Deep Learning (1907.10701v4)

Published 24 Jul 2019 in cs.LG, cs.PF, and stat.ML

Abstract: Training deep learning models is compute-intensive and there is an industry-wide trend towards hardware specialization to improve performance. To systematically benchmark deep learning platforms, we introduce ParaDnn, a parameterized benchmark suite for deep learning that generates end-to-end models for fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. Along with six real-world models, we benchmark Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU platform. We take a deep dive into TPU architecture, reveal its bottlenecks, and highlight valuable lessons learned for future specialized system design. We also provide a thorough comparison of the platforms and find that each has unique strengths for some types of models. Finally, we quantify the rapid performance improvements that specialized software stacks provide for the TPU and GPU platforms.

Authors (3)

Yu Emma Wang (9 papers)
Gu-Yeon Wei (54 papers)
David Brooks (204 papers)

Citations (248)

View on Semantic Scholar

Summary

Benchmarking TPU, GPU, and CPU Platforms for Deep Learning: An Analysis

The paper presents a comprehensive benchmarking analysis of deep learning hardware platforms, focusing on Google's Cloud TPU v2/v3, NVIDIA's V100 GPU, and an Intel Skylake CPU. A novel parameterized benchmark suite, ParaDnn, is introduced to generate a wide range of deep learning models, facilitating an extensive evaluation of these platforms. This analysis addresses critical aspects of hardware specialization in deep learning, providing insights into architectural designs and software optimizations crucial for advancing deep learning infrastructure.

Overview and Methodology

Deep learning has transformed various application domains, which has led to an amplified demand for hardware capable of efficiently training increasingly complex models. ParaDnn was developed to address the limitations of existing benchmarks by enabling a systematic exploration of the design space across fully connected (FC), convolutional (CNN), and recurrent (RNN) neural networks. This suite generates models across nearly six orders-of-magnitude of parameter sizes, allowing for a comprehensive evaluation of platform performance under varied deep learning workloads.

The paper benchmarks the performance of TPU, GPU, and CPU platforms using both parameterized and real-world deep learning models, including well-known architectures such as ResNet-50 and Transformer. The analysis employs metrics like examples/second and platform-specific speedups to evaluate the capability of each platform to handle different model architectures and configurations.

Key Findings

TPU Architecture: The TPU is shown to be highly optimized for CNN and large-batch computations. However, it suffers from memory bandwidth bottlenecks and inter-chip communication overheads. The analysis reveals that memory-bound operations constitute a significant fraction of workload execution time, highlighting the need for memory optimization in future hardware iterations.
Cross-Platform Comparison: Each platform exhibits unique strengths:
- The TPU is advantageous for large-batch and large CNN workloads due to its highly parallel architecture.
- The GPU excels in scenarios involving small-batch computations and complex non-MatMul operations due to its flexible memory management and high bandwidth.
- The CPU remains crucial for very large FC models due to its high memory capacity, despite lower throughput compared to specialized hardware.
Software Stack Influence: The rapid performance improvements in specialized software stacks for both TPU and GPU platforms are quantitatively significant. The paper notes dramatic speedup achieved with successive updates in TensorFlow and CUDA, suggesting that compiler optimizations can substantially impact deep learning performance.

Implications and Future Directions

The insights derived from this paper suggest several architectural and software-level considerations for future hardware designs:

Memory Bandwidth Enhancements: Given the significant time consumed by memory-bound operations, increasing memory bandwidth and optimizing memory access patterns are crucial for improving performance across platforms.
Compiler and Software Optimizations: Continued advancements in compiler technology, as demonstrated by substantial TPU improvements with TensorFlow updates, underscore the potential of software optimizations to harness the full capabilities of the hardware effectively.
Scalability and Model Support: Enabling broader model parallelism and pipelining could allow TPUs and GPUs to support larger models that they currently cannot handle due to memory constraints.

As deep learning models continue to evolve, the need for adaptive and versatile computing platforms becomes evident. While ParaDnn provides a robust framework for benchmarking current hardware, continuous updates and expansions to the benchmark suite will be essential to accommodate new model architectures and workloads. Future research could explore multi-node configurations, further optimizing performance for distributed deep learning systems.

Overall, this paper highlights the complexities and opportunities inherent in designing and evaluating hardware for deep learning. The findings provide a foundation for future innovations in both hardware architecture and software environments tailored to the rapidly advancing field of artificial intelligence.

PDF Markdown

Related Papers

YouTube

Show All Videos