Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Learning to Optimize Tensor Programs (1805.08166v4)

Published 21 May 2018 in cs.LG and stat.ML

Abstract: We introduce a learning-based framework to optimize tensor programs for deep learning workloads. Efficient implementations of tensor operators, such as matrix multiplication and high dimensional convolution, are key enablers of effective deep learning systems. However, existing systems rely on manually optimized libraries such as cuDNN where only a narrow range of server class GPUs are well-supported. The reliance on hardware-specific operator libraries limits the applicability of high-level graph optimizations and incurs significant engineering costs when deploying to new hardware targets. We use learning to remove this engineering burden. We learn domain-specific statistical cost models to guide the search of tensor operator implementations over billions of possible program variants. We further accelerate the search by effective model transfer across workloads. Experimental results show that our framework delivers performance competitive with state-of-the-art hand-tuned libraries for low-power CPU, mobile GPU, and server-class GPU.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Tianqi Chen (77 papers)
  2. Lianmin Zheng (34 papers)
  3. Eddie Yan (5 papers)
  4. Ziheng Jiang (23 papers)
  5. Thierry Moreau (11 papers)
  6. Luis Ceze (38 papers)
  7. Carlos Guestrin (58 papers)
  8. Arvind Krishnamurthy (37 papers)
Citations (372)

Summary

Learning to Optimize Tensor Programs: A Machine Learning-Based Approach

The paper "Learning to Optimize Tensor Programs" by Chen et al. proposes an innovative framework that employs machine learning to optimize tensor programs for deep learning workloads. This research addresses the substantial engineering effort needed to develop efficient implementations of tensor operators by using statistical cost models to automate the optimization process. The proposed framework offers a potential solution that not only supports a broad range of hardware backends but also maintains performance comparable to state-of-the-art hand-tuned libraries.

Problem Statement

Tensor operators, including matrix multiplication and high-dimensional convolution, form the crux of deep learning models. Traditionally, to optimize these operators, programmers heavily rely on manually-tuned libraries such as cuDNN, which cater to only a limited array of GPUs. This limitation not only incurs hefty engineering costs when adapting to novel hardware targets but also constrains the extent and quality of high-level graph optimizations. The paper embarks on solving the problem of efficiently optimizing tensor programs for varied hardware setups by leveraging machine learning techniques.

Methodology

The authors introduce a statistical cost model that predicts the runtime of various program implementations. These models are deployed to navigate through the expansive space of possible program transformations, directing the search towards optimal implementations. The optimization process is further enhanced through effective model transfer across workloads. The machine learning framework developed in this work is built upon two primary approaches: Gradient Boosted Trees (GBTs) and TreeGRU, a neural-based model that encodes abstract syntax trees (ASTs) of low-level programs.

A notable aspect of the methodology is the adoption of a transfer learning strategy, allowing the framework to learn from prior experience across different workloads. To ensure scalability and efficiency, the search space for program optimization is meticulously defined using domain-specific languages (DSLs) and transformation primitives to encapsulate hardware-aware optimizations.

Key Findings

Experimental result highlights assert that the proposed machine learning framework successfully yields performance improvements on par with and sometimes exceeding that of hand-tuned libraries. The framework achieves end-to-end performance gains between 1.2× to 3.8× across different platforms, such as low-power CPUs and mobile GPUs. Such findings underscore the efficacy of leveraging learned cost models to address tensor program optimization challenges.

Implications and Future Directions

The implications of this research are extensive, particularly as the field of deep learning continues to expand across diverse hardware architectures. The suggested approach provides a scalable mechanism that minimizes manual engineering effort while enabling optimizations that accommodate nascent operator configurations and complex data types. This has far-reaching potential applications in areas necessitating optimized deployment of deep learning models, ranging from cloud-based services to edge devices in IoT systems.

Future avenues for research may involve refining the cost models to enhance prediction accuracy further or incorporating more comprehensive transfer learning strategies that could accelerate optimization across heterogenous hardware environments. Additionally, exploring reinforcement learning techniques to autonomously refine the search space definition could further optimize tensor program implementations.

Conclusion

The paper presents a significant step forward in automating the optimization of tensor programs for deep learning workloads by deploying machine learning-based models. This work highlights the viability of statistical learning approaches in addressing the complexities of system deployments on varied hardware platforms, presenting a comprehensive framework that can potentially reshape deep learning system development paradigms.

X Twitter Logo Streamline Icon: https://streamlinehq.com