Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators (1809.04070v2)

Published 10 Sep 2018 in cs.DC

Abstract: We show that DNN accelerator micro-architectures and their program mappings represent specific choices of loop order and hardware parallelism for computing the seven nested loops of DNNs, which enables us to create a formal taxonomy of all existing dense DNN accelerators. Surprisingly, the loop transformations needed to create these hardware variants can be precisely and concisely represented by Halide's scheduling language. By modifying the Halide compiler to generate hardware, we create a system that can fairly compare these prior accelerators. As long as proper loop blocking schemes are used, and the hardware can support mapping replicated loops, many different hardware dataflows yield similar energy efficiency with good performance. This is because the loop blocking can ensure that most data references stay on-chip with good locality and the processing units have high resource utilization. How resources are allocated, especially in the memory system, has a large impact on energy and performance. By optimizing hardware resource allocation while keeping throughput constant, we achieve up to 4.2X energy improvement for Convolutional Neural Networks (CNNs), 1.6X and 1.8X improvement for Long Short-Term Memories (LSTMs) and multi-layer perceptrons (MLPs), respectively.

Authors (12)

Xuan Yang (49 papers)
Mingyu Gao (22 papers)
Qiaoyi Liu (4 papers)
Jeff Ou Setter (1 paper)
Jing Pu (7 papers)
Ankita Nayak (5 papers)
Steven Emberton Bell (1 paper)
Kaidi Cao (26 papers)
Heonjae Ha (1 paper)
Priyanka Raina (11 papers)
Christos Kozyrakis (31 papers)
Mark Horowitz (21 papers)

Citations (207)

View on Semantic Scholar

Summary

Analyzing DNN Accelerators through Halide's Scheduling Language

This paper presents a comprehensive framework for analyzing and designing DNN accelerators by leveraging Halide's scheduling language. The authors propose a methodology that connects the loop transformations intrinsic to DNN computations with the corresponding micro-architectures of accelerators, enabling efficient exploration and creation of design spaces. The approach is formalized through a taxonomy that represents all possible dense DNN accelerators as variations in loop order and hardware dataflows, utilizing Halide's language to precisely express these transformations.

Loop-Based Approach to DNN Accelerator Design

The central thesis of this work is that the design space of DNN accelerators can be characterized by how they manipulate the nested loops central to DNN computations, particularly convolution operations. Through loop transformations such as blocking, reordering, and parallelizing, diverse micro-architectures can be formulated in terms of their support for different dataflows and resource allocations. Key innovations include extending Halide's language to express necessary hardware distinctions, such as local data propagation in systolic arrays, and developing an overview toolchain to generate hardware from these extensions.

Impact of Loop Transformations

One of the notable insights provided by the authors is that energy efficiency across various dataflow choices converges when optimal loop blocking strategies are employed. For architectures with similar resources, the observed performance and energy differences among diverse dataflows are minimal, underscoring the significance of loop blocking over dataflow choice. Furthermore, supporting replication in loop mapping substantially enhances computational resource utilization, a crucial aspect for achieving throughput optimization.

Optimizing Memory Hierarchies

The work recognizes memory resource allocation as a pivotal factor in the overall efficiency of DNN accelerators. A sub-optimal allocation, particularly at the register file level, can significantly impact energy consumption. Thus, by carefully sizing each level of the memory hierarchy, smaller register files combined with deeper hierarchies contribute to substantial improvements in energy efficiency. The research suggests that balancing the data allocations across the hierarchy is essential, particularly for memory-bound layers with limited data reuse opportunities.

Theoretical and Practical Implications

The theoretical framework laid out in this paper paves the way for more systematic and efficient exploration of DNN accelerator designs. By unifying the understanding of loop transformations and hardware mapping, this model provides a foundation for both academics and industry practitioners to evaluate and compare novel architectures. Practically, the synthesis toolchain derived from this framework aids in rapidly prototyping new accelerator designs, streamlining the design process through reusable scheduling primitives.

Future Developments in AI Accelerator Research

As AI model complexities continue to increase, the adaptability of accelerator designs becomes more pivotal. Future research might expand upon this framework to incorporate adaptive scheduling languages that automatically fine-tune loop transformations to emerging workload demands. Additionally, the integration of increasingly heterogeneous computing resources, including FPGAs and novel memory technologies, could be explored to further expand the design space captured by this work.

In conclusion, "Interstellar: Using Halide's Scheduling Language to Analyze DNN Accelerators" offers a structured approach to understanding and creating DNN accelerators, shedding light on the intricate balance of computational loops and memory architectures that drive energy and performance efficiency. The methods and insights detailed in this paper are poised to play a significant role in guiding future developments in the field of AI hardware acceleration.

Related Papers

Find Related Papers