Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Guac: Energy-Aware and SSA-Based Generation of Coarse-Grained Merged Accelerators from LLVM-IR (2402.13513v1)

Published 21 Feb 2024 in cs.AR

Abstract: Designing accelerators for resource- and power-constrained applications is a daunting task. High-level Synthesis (HLS) addresses these constraints through resource sharing, an optimization at the HLS binding stage that maps multiple operations to the same functional unit. However, resource sharing is often limited to reusing instructions within a basic block. Instead of searching globally for the best control and dataflow graphs (CDFGs) to combine, it is constrained by existing instruction mappings and schedules. Coarse-grained function merging (CGFM) at the intermediate representation (IR) level can reuse control and dataflow patterns without dealing with the post-scheduling complexity of mapping operations onto functional units, wires, and registers. The merged functions produced by CGFM can be translated to RTL by HLS, yielding Coarse Grained Merged Accelerators (CGMAs). CGMAs are especially profitable across applications with similar data- and control-flow patterns. Prior work has used CGFM to generate CGMAs without regard for which CGFM algorithms best optimize area, power, and energy costs. We propose Guac, an energy-aware and SSA-based (static single assignment) CGMA generation methodology. Guac implements a novel ensemble of cost models for efficient CGMA generation. We also show that CGFM algorithms using SSA form to merge control- and dataflow graphs outperform prior non-SSA CGFM designs. We demonstrate significant area, power, and energy savings with respect to the state of the art. In particular, Guac more than doubles energy savings with respect to the closest related work while using a strong resource-sharing baseline.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Area-efficient instruction set synthesis for reconfigurable system-on-chip designs. In DAC’41.
  2. Early DSE and Automatic Generation of Coarse-grained Merged Accelerators. ACM Trans. Embed. Comput. Syst. 22, 2 (Jan 2023).
  3. Register allocation via coloring. Computer Languages 6, 1 (1981), 47–57.
  4. Diannao: A small-footprint high-throughput accelerator for ubiquitous machine-learning. In ACM Sigplan Notices.
  5. Jason Cong and Wei Jiang. [n. d.]. Pattern-based behavior synthesis for FPGA resource reduction. In FPGA’16.
  6. Fast and accurate estimation of quality of results in high-level synthesis with machine learning. In 2018 IEEE 26th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 129–132.
  7. Accurate area, time and power models for FPGA-based implementations. Journal of Signal Processing Systems 63, 1 (2011), 39–50.
  8. Daniel D Gajski and Loganath Ramachandran. 1994. Introduction to high-level synthesis. IEEE Design & Test of Computers 11, 4 (1994), 44–54.
  9. Impact of FPGA architecture on resource sharing in high-level synthesis, ACM. In SIGDA International Symposium on Field Programmable Gate Arrays, Monterey, CA, Vol. 10. 2145694–2145712.
  10. The effect of compiler optimizations on high-level synthesis-generated hardware. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 8, 3 (2015), 1–26.
  11. KAPow: A system identification approach to online per-module power estimation in FPGA designs. In 2016 IEEE 24th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM). IEEE, 56–63.
  12. Siew-Kei Lam and Thambipillai Srikanthan. 2009. Rapid design of area-efficient custom instructions for reconfigurable embedded processing. Journal of Systems Architecture (2009).
  13. Chris Lattner and Vikram Adve. 2004. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO. 75.
  14. Pudiannao: A polyvalent machine learning accelerator. In ASPLOS’15.
  15. Soda: a new synthesis infrastructure for agile hardware design of machine learning accelerators. In Proceedings of the 39th International Conference on Computer-Aided Design. 1–7.
  16. Parnian Mokri and Mark Hempstead. 2020. Early-stage automated accelerator identification tool for embedded systems with limited area. In 2020 IEEE/ACM International Conference On Computer Aided Design (ICCAD). IEEE, 1–8.
  17. Efficient datapath merging for partially reconfigurable architectures. TCAD (2005).
  18. A survey and evaluation of FPGA high-level synthesis tools. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 35, 10 (2015), 1591–1604.
  19. Christian Pilato and Fabrizio Ferrandi. 2012. Bambu: A free framework for the high-level synthesis of complex applications. University Booth of DATE 29 (2012), 2011.
  20. Effective function merging in the SSA form. In PLDI. 854–868.
  21. Machsuite: Benchmarks for accelerator design and customized architectures. In 2014 IEEE International Symposium on Workload Characterization (IISWC). IEEE, 110–119.
  22. HyFM: Function Merging for Free. In LCTES. 110–121.
  23. Function Merging by Sequence Alignment. In CGO. 149–163.
  24. gem5-SALAM: A System Architecture for LLVM-based Accelerator Modeling. In 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 471–482.
  25. Aladdin: A pre-RTL, power-performance accelerator simulator enabling large design space exploration of customized architectures. In ACM SIGARCH Computer Architecture News.
  26. Navion: A 2-mw fully integrated real-time visual-inertial odometry accelerator for autonomous navigation of nano drones. IEEE Journal of Solid-State Circuits 54, 4 (2019), 1106–1119.
  27. A 16-nm SoC for Noise-Robust Speech and NLP Edge AI Inference With Bayesian Sound Source Separation and Attention-Based DNNs. IEEE Journal of Solid-State Circuits (2022).
  28. Stitch: Fusible heterogeneous accelerators enmeshed with many-core architecture for wearables. In ISCA’45.
  29. Conservation cores: Reducing the energy of mature computations. In ACM SIGARCH Computer Architecture News.
  30. QsCores: Trading dark silicon for scalable energy efficiency with quasi-specific cores. In MICRO’41.
  31. Xilinx. 2023. Vivado Physical Design. RTL to bitstream. www.xilinx.com/products/design-tools/vivado/integration/esl-design.html.
  32. Deep Program Structure Modeling Through Multi-Relational Graph-based Learning. In Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques (PACT). 111–123.
  33. Accurate high-level modeling and automated hardware/software co-design for effective soc design space exploration. In DAC’54.

Summary

We haven't generated a summary for this paper yet.