Stream-HLS: Towards Automatic Dataflow Acceleration (2501.09118v1)

Published 15 Jan 2025 in cs.AR

Abstract: High-level synthesis (HLS) has enabled the rapid development of custom hardware circuits for many software applications. However, developing high-performance hardware circuits using HLS is still a non-trivial task requiring expertise in hardware design. Further, the hardware design space, especially for multi-kernel applications, grows exponentially. Therefore, several HLS automation and abstraction frameworks have been proposed recently, but many issues remain unresolved. These issues include: 1) relying mainly on hardware directives (pragmas) to apply hardware optimizations without exploring loop scheduling opportunities. 2) targeting single-kernel applications only. 3) lacking automatic and/or global design space exploration. 4) missing critical hardware optimizations, such as graph-level pipelining for multi-kernel applications. To address these challenges, we propose a novel methodology and framework on top of the popular multi-level intermediate representation (MLIR) infrastructure called Stream-HLS. Our framework takes a C/C++ or PyTorch software code and automatically generates an optimized dataflow architecture along with host code for field-programmable gate arrays (FPGAs). To achieve this, we developed an accurate analytical performance model for global scheduling and optimization of dataflow architectures. Stream-HLS is evaluated using various standard HLS benchmarks and real-world benchmarks from transformer models, convolution neural networks, and multilayer perceptrons. Stream-HLS designs outperform the designs of prior state-of-the-art automation frameworks and manually-optimized designs of abstraction frameworks by up to $79.43\times$ and $10.62\times$ geometric means respectively. Finally, the Stream-HLS framework is modularized, extensible, and open-sourced at \url{https://github.com/UCLA-VAST/Stream-HLS} (\url{https://doi.org/10.5281/zenodo.14585909}).

Summary

The paper introduces an automated framework that converts multi-kernel application code into efficient FPGA dataflow architectures using MLIR optimizations.
It employs a precise analytical performance model and global scheduling via MINLP to achieve speedups of up to 79.43x over existing frameworks.
The approach significantly reduces the expertise barrier in hardware design, streamlining high-performance accelerator development for AI and machine learning.

Stream-HLS: Towards Automatic Dataflow Acceleration

The advent of High-Level Synthesis (HLS) has significantly transformed the field of hardware circuit development, notably elevating the abstraction level from traditional Hardware Description Languages (HDLs) like Verilog to higher-level languages such as C/C++. Yet, despite these advancements, the challenge remains in designing high-performance and efficient hardware circuits, particularly when dealing with complex multi-kernel applications where the design space expands exponentially. This paper introduces Stream-HLS, an innovative framework that addresses the unresolved intricacies in HLS automation. Stream-HLS operates atop the multi-level intermediate representation (MLIR) infrastructure, aiming to automate the synthesis of dataflow architectures from source code written in C/C++ or PyTorch.

Stream-HLS distinguishes itself through a robust methodology that involves the automatic transformation of software applications into dataflow architectures optimized for FPGAs. It leverages a precise analytical performance model that facilitates global scheduling and optimization. The evaluations conducted highlight the efficacy of Stream-HLS, with the framework achieving up to $79.43\times$ speedups in comparison to state-of-the-art automated frameworks and $10.62\times$ when compared to manually-optimized designs of abstraction frameworks.

Key Contributions

Stream-HLS introduces several distinct advancements:

Automated Dataflow Design: The framework fully automates the conversion process, transforming multi-kernel application code into a dataflow architecture with streaming capabilities.
MLIR Optimization Framework: Stream-HLS incorporates a dedicated MLIR library that performs optimizations, including dataflow canonicalization and buffer-to-FIFO conversion.
Comprehensive Performance Model: A critical contribution is the accurate performance modeling for both inter-task communication using shared-buffer and FIFO interfaces, foundational to understanding the resource allocation and pipelining strategies.
Global Scheduling Optimization: Through the deployment of mixed integer-nonlinear programming (MINLP), Stream-HLS optimizes task scheduling, enhancing the parallel execution of multi-kernel applications at a global scale.

Evaluation and Practical Implications

The evaluation benchmarked Stream-HLS against standard HLS benchmarks and real-world applications like transformer models, CNNs, and multilayer perceptrons. The results illustrate Stream-HLS' capability to automate and improve the design efficiency and performance of hardware accelerators. By surpassing manually-tuned designs by such notable margins, Stream-HLS presents a significant step toward reducing the expertise barrier in developing high-performance hardware circuits.

The implications are vast, especially in domains requiring rapid yet efficient hardware adaptation to new algorithms and applications, such as AI and machine learning. Automating the optimization of design space exploration as exemplified by Stream-HLS can lead to more agile and cost-effective deployment of programmable hardware accelerators.

Future Directions

The framework sets the stage for ongoing innovations, particularly in enhancing the exploration of design spaces under different paradigms, such as extending the capabilities for multi-kernel optimizations. Future expansions may include incorporating more comprehensive support for stencils and streaming transformations, further simplifying the process of hardware accelerator development in broader scientific and engineering applications.

In conclusion, Stream-HLS exemplifies a major advancement in HLS, offering an automated and robust methodology to design innovative hardware accelerators. By merging performance modeling and global scheduling optimization, it emerges as a valuable asset in the toolkit for researchers and engineers facing the complexities of modern hardware design. The framework is open-sourced, promoting further academic and industry collaboration to refine and adopt these methodologies widely.

PDF Markdown

Related Papers

Tweets

https://twitter.com/Underfox3/status/1880329025073406061