Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ansor: Generating High-Performance Tensor Programs for Deep Learning (2006.06762v5)

Published 11 Jun 2020 in cs.LG, cs.NE, cs.PF, cs.PL, and stat.ML

Abstract: High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning systems rely on vendor-provided kernel libraries or various search strategies to get performant tensor programs. These approaches either require significant engineering effort to develop platform-specific optimization code or fall short of finding high-performance programs due to restricted search space and ineffective exploration strategy. We present Ansor, a tensor program generation framework for deep learning applications. Compared with existing search strategies, Ansor explores many more optimization combinations by sampling programs from a hierarchical representation of the search space. Ansor then fine-tunes the sampled programs with evolutionary search and a learned cost model to identify the best programs. Ansor can find high-performance programs that are outside the search space of existing state-of-the-art approaches. In addition, Ansor utilizes a task scheduler to simultaneously optimize multiple subgraphs in deep neural networks. We show that Ansor improves the execution performance of deep neural networks relative to the state-of-the-art on the Intel CPU, ARM CPU, and NVIDIA GPU by up to $3.8\times$, $2.6\times$, and $1.7\times$, respectively.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Lianmin Zheng (34 papers)
  2. Chengfan Jia (1 paper)
  3. Minmin Sun (3 papers)
  4. Zhao Wu (28 papers)
  5. Cody Hao Yu (13 papers)
  6. Ameer Haj-Ali (8 papers)
  7. Yida Wang (62 papers)
  8. Jun Yang (357 papers)
  9. Danyang Zhuo (33 papers)
  10. Koushik Sen (49 papers)
  11. Joseph E. Gonzalez (167 papers)
  12. Ion Stoica (177 papers)
Citations (327)

Summary

An In-depth Analysis of Ansor: A Framework for Generating High-Performance Tensor Programs

The paper presents Ansor, a framework designed to generate high-performance tensor programs essential for the efficient execution of deep learning applications. Unlike conventional approaches that depend on vendor-provided libraries or require significant engineering efforts to optimize programs for specific hardware, Ansor utilizes a novel tensor program generation methodology supported by a more comprehensive search strategy.

Technical Contributions and Methodology

Ansor introduces a hierarchical approach to explore a significantly larger optimization space than existing strategies. It leverages a two-level hierarchical search space consisting of sketches and annotations:

  1. Sketch Generation: This top-level abstraction defines the high-level structure of the tensor programs. The paper delineates how Ansor employs a series of derivation rules to construct potential sketches. These rules guide the search through considerable variances of tile structures and fusion strategies, automatically generated for any given computation definition. Rules handle operations such as multi-level tiling and aggressive inlining, uniquely covering optimizations usually beyond the scope of traditional methods.
  2. Random Annotation: Within the sketches, Ansor explores billions of possible low-level details (e.g., tile sizes, parallelization schemes) by random annotation, leading to complete and diverse programs.

The exploration strategies use a combination of evolutionary search and learned cost models, providing robustness in identifying optimal tensor programs that outperform existing frameworks.

Empirical Results

The paper's empirical section presents comprehensive benchmarks across various levels:

  • Single-Operator Benchmarks: Evaluated against frameworks like PyTorch, Halide auto-scheduler, FlexTensor, and AutoTVM, Ansor consistently matches or surpasses them in performance. This is particularly evident in operations like matrix multiplication and convolutions where standard baselines show reduced effectiveness.
  • Subgraph and Network Benchmarks: The results demonstrate Ansor’s ability to outperform vendor-specific libraries and established search frameworks significantly. For instance, it improves performance on Intel CPUs and NVIDIA GPUs by up to 3.8×3.8\times and 1.7×1.7\times better than the best alternative approaches.

Implications and Future Directions

Ansor signifies a shift towards large-scale automated program generation that effectively bridges the gap between diverse hardware architectures and optimal execution of AI workloads. By extending the boundaries of known optimizations and incorporating a task scheduler for resource allocation, Ansor demonstrates not just theoretical innovation but practical applicability across multiple hardware backends and AI models, including modern architectures such as ResNet, MobileNet, and BERT.

One of the critical limitations acknowledged is its reliance on static computation graphs, which poses challenges for dynamic workloads prevalent in certain AI applications. Future developments could address these gaps by integrating support for dynamic shapes and extending support for sparse operators, thus enhancing the utility of Ansor in dynamic computational environments and network types involving graph-based neural networks.

Conclusion

The paper convincingly positions Ansor as a valuable tool for researchers and practitioners in the domain of deep learning, offering path-breaking performance improvements without significant manual intervention. With its innovative exploration strategies and efficient task scheduling capabilities, Ansor fosters the development of scalable, high-performance tensor programs apt for the ever-evolving landscape of AI and computational hardware. As the technology matures and is integrated with broader ecosystem components, it is poised to set a new standard in automated performance optimization for deep learning models.