Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs (2210.09603v2)

Published 18 Oct 2022 in cs.LG, cs.AI, and cs.PL

Abstract: As deep learning models nowadays are widely adopted by both cloud services and edge devices, reducing the latency of deep learning model inferences becomes crucial to provide efficient model serving. However, it is challenging to develop efficient tensor programs for deep learning operators due to the high complexity of modern accelerators and the rapidly growing number of operators. Deep learning compilers, such as Apache TVM, adopt declarative scheduling primitives to lower the bar of developing tensor programs. However, we show that this approach is insufficient to cover state-of-the-art tensor program optimizations. In this paper, we propose to embed the scheduling process into tensor programs and use dedicated mappings, called task mappings, to define the computation assignment and ordering. This new approach greatly enriches the expressible optimizations by allowing developers to manipulate tensor programs at a much finer granularity. We call the proposed method the task-mapping programming paradigm. In addition, we propose a new post-scheduling fusion optimization that allows developers to focus on scheduling every single operator and automates the fusion after scheduling. It greatly reduces the engineering efforts for operator fusion. Our proposed paradigm also constructs an efficient hardware-centric schedule space, which is agnostic to the program input size and greatly reduces the tuning time. With the proposed paradigm, we implement a deep learning compiler Hidet. Extensive experiments on modern convolution and transformer models show that Hidet outperforms state-of-the-art DNN inference framework, ONNX Runtime, and compiler, TVM equipped with scheduler AutoTVM and Ansor, by up to 1.48x (1.22x on average). It also reduces the tuning time by 20x and 11x compared with AutoTVM and Ansor, respectively. We open-sourced hidet at https://www.github.com/hidet-org/hidet.

Citations (22)

View on Semantic Scholar

Summary

The paper introduces a novel task-mapping paradigm that embeds scheduling into tensor programs, enabling advanced optimizations like double buffering.
It leverages post-scheduling fusion to decouple composite operator scheduling and enhance resource utilization on modern hardware.
Experimental results demonstrate up to 20x faster tuning compared to frameworks like TVM, PyTorch, and Ansor.

An Expert Overview of "Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs"

The paper "Hidet: Task-Mapping Programming Paradigm for Deep Learning Tensor Programs" presents a novel approach to optimizing tensor programs for deep learning systems, primarily targeting modern accelerators such as GPUs. The authors identify limitations in existing loop-oriented scheduling mechanisms found in state-of-the-art deep learning compilers like Apache TVM, which are unable to efficiently express complex optimizations required for peak performance on hardware accelerators.

Key Contributions

The main contribution of this work is the introduction of the task-mapping programming paradigm, which allows for finer granularity in optimizing tensor programs. This paradigm replaces the traditional loop-oriented scheduling with a direct embedding of scheduling into tensor programs using task mappings. Task mappings define task assignments and execution orders for parallel processing units, thus enabling a broader range of optimizations that were previously inaccessible.

Task-Mapping Programming Paradigm

The paper details how the task-mapping paradigm facilitates the expression of optimizations like double buffering, which are crucial for fully utilizing both memory and computational units during matrix operations. By leveraging task mappings, developers can directly assign computations to processing units, allowing for more intricate control over the execution flow and resource utilization.

Post-Scheduling Fusion

Hidet's architecture also introduces post-scheduling fusion, which decouples the scheduling of composite operations from individual operator scheduling. This reduces complexity by allowing automatic fusion of operators post-optimization, thereby simplifying the development process while maintaining performance benefits.

Hardware-Centric Schedule Space

To address inefficiencies in schedule spaces, the authors propose a hardware-centric approach, focusing on the capabilities of the underlying hardware rather than specific input sizes. This method significantly reduces tuning time and ensures robust performance across varying input dimensions.

Experimental Findings

The paper presents rigorous experimental validations, demonstrating that Hidet consistently outperforms existing frameworks such as PyTorch, ONNX Runtime, TVM’s AutoTVM, and Ansor. The performance gains are largely attributed to Hidet's ability to implement advanced optimizations like double buffering efficiently. Hidet's tuning process is notably quicker, with reductions in tuning time by factors of up to 20x compared to existing solutions.

Implications and Future Directions

The implications of this research are significant in both practical applications and theoretical explorations of compiler design for deep learning. Practically, Hidet provides a framework that improves inference latency and resource utilization on accelerators. Theoretically, it opens avenues for further refining scheduling paradigms and exploring hybrid approaches that balance expressiveness and ease of development.

Future developments could include extending Hidet to support various hardware platforms beyond GPUs and incorporating additional advanced optimizations. Moreover, exploring the integration of Hidet with emerging AI workloads and models could provide insights into enhancing scalability and adaptability.

Conclusion

Hidet marks a substantial advancement in deep learning compiler technology, addressing the limitations of current methods with a novel task-mapping programming paradigm that enhances both performance and developer usability. Its contributions lay the groundwork for more efficient execution of deep learning models on modern compute architectures, offering pathways to future innovations in AI infrastructure.

PDF Markdown