Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TVM: An Automated End-to-End Optimizing Compiler for Deep Learning (1802.04799v3)

Published 12 Feb 2018 in cs.LG, cs.AI, and cs.PL

Abstract: There is an increasing need to bring machine learning to a wide diversity of hardware devices. Current frameworks rely on vendor-specific operator libraries and optimize for a narrow range of server-class GPUs. Deploying workloads to new platforms -- such as mobile phones, embedded devices, and accelerators (e.g., FPGAs, ASICs) -- requires significant manual effort. We propose TVM, a compiler that exposes graph-level and operator-level optimizations to provide performance portability to deep learning workloads across diverse hardware back-ends. TVM solves optimization challenges specific to deep learning, such as high-level operator fusion, mapping to arbitrary hardware primitives, and memory latency hiding. It also automates optimization of low-level programs to hardware characteristics by employing a novel, learning-based cost modeling method for rapid exploration of code optimizations. Experimental results show that TVM delivers performance across hardware back-ends that are competitive with state-of-the-art, hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPUs. We also demonstrate TVM's ability to target new accelerator back-ends, such as the FPGA-based generic deep learning accelerator. The system is open sourced and in production use inside several major companies.

An Insightful Overview of TVM: An Automated End-to-End Optimizing Compiler for Deep Learning

The paper presents TVM, an automated compiler designed to optimize deep learning workloads across diverse hardware back-ends, addressing the limitations of existing frameworks that rely heavily on vendor-specific libraries optimized primarily for server-class GPUs. The proposed compiler supports a wide range of devices, including mobile phones, embedded systems, and accelerators like FPGAs and ASICs, by automating key optimizations that were previously handled manually.

Key Contributions and Methodologies

TVM introduces several novel techniques to enhance deep learning model deployment:

  1. Tensor Expression Language and Schedule Space: TVM extends the Halide compute/schedule separation principle to support novel optimizations specific to deep learning and new hardware back-ends. This separation allows TVM to generate diverse versions of a program with various optimizations, which can be mapped efficiently to specific hardware architectures.
  2. Automated Program Optimization Framework: To overcome the challenge of manually tuning operators for numerous combinations of memory access patterns, threading, and novel hardware primitives, TVM employs a ML-based cost model. This ML model predicts execution times relative to different hardware configurations, guiding the automated search for optimal operator implementations without explicit hardware information.
  3. Performance and Portability: Through its automated compiler framework, TVM delivers performance that is competitive with state-of-the-art hand-tuned libraries across various hardware platforms. It achieves speedups ranging from 1.2x to 3.8x across different devices, including server-class GPUs and embedded processors.

Evaluation and Results

The paper provides comprehensive experimental results demonstrating TVM's capability to deliver portable performance optimizations:

  • Server-Class GPUs: TVM was shown to surpass existing DL frameworks like MXNet and TensorFlow, especially in handling less common operators like depthwise convolution, offering substantial speedups and performance enhancements.
  • Embedded Devices: The evaluation on ARM-based CPUs and GPUs highlighted TVM's ability to generate optimized code that exceeds the efficiency of manually crafted operators in existing frameworks. Its adaptability to mobile and embedded environments was further showcased through low-precision computation, where TVM outperformed the optimized implementations provided by frameworks like Caffe2.
  • FPGA-Based Accelerators: Beyond traditional devices, TVM successfully optimized workloads for FPGA-based accelerators. By exploiting its flexible tensorization and latency-hiding capabilities, the compiler demonstrated enhanced utilization of compute and memory resources on customizable hardware accelerators.

Implications and Future Directions

The introduction of TVM marks a significant step in the automation of deploying deep learning models across a vast array of hardware architectures. This contributes to the democratization of ML by facilitating efficient execution on platforms beyond high-performance servers. As machine learning models become more complex and diverse hardware continues evolving, TVM could inspire advancements in adaptive compiler technologies capable of dynamically optimizing DL workloads. Future directions may explore more refined cost models or incorporate reinforcement learning to further enhance the automation and adaptation capabilities of the ML-based optimizer, while ensuring scalability to emerging and unconventional hardware.

In conclusion, TVM epitomizes a paradigm shift in managing deep learning workload optimization, efficiently bridging the gap between high-level model specification and effective hardware execution. The systems community will undoubtedly find this research valuable in furthering the development and practical application of AI technologies across myriad computational landscapes.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Tianqi Chen (77 papers)
  2. Thierry Moreau (11 papers)
  3. Ziheng Jiang (23 papers)
  4. Lianmin Zheng (34 papers)
  5. Eddie Yan (5 papers)
  6. Meghan Cowan (9 papers)
  7. Haichen Shen (6 papers)
  8. Leyuan Wang (15 papers)
  9. Yuwei Hu (15 papers)
  10. Luis Ceze (38 papers)
  11. Carlos Guestrin (58 papers)
  12. Arvind Krishnamurthy (37 papers)
Citations (383)