STRELA: STReaming ELAstic CGRA Accelerator for Embedded Systems (2404.12503v1)
Abstract: Reconfigurable computing offers a good balance between flexibility and energy efficiency. When combined with software-programmable devices such as CPUs, it is possible to obtain higher performance by spatially distributing the parallelizable sections of an application throughout the reconfigurable device while the CPU is in charge of control-intensive sections. This work introduces an elastic Coarse-Grained Reconfigurable Architecture (CGRA) integrated into an energy-efficient RISC-V-based SoC designed for the embedded domain. The microarchitecture of CGRA supports conditionals and irregular loops, making it adaptable to domain-specific applications. Additionally, we propose specific mapping strategies that enable the efficient utilization of the CGRA for both simple applications, where the fabric is only reconfigured once (one-shot kernel), and more complex ones, where it is necessary to reconfigure the CGRA multiple times to complete them (multi-shot kernels). Large kernels also benefit from the independent memory nodes incorporated to streamline data accesses. Due to the integration of CGRA as an accelerator of the RISC-V processor enables a versatile and efficient framework, providing adaptability, processing capacity, and overall performance across various applications. The design has been implemented in TSMC 65 nm, achieving a maximum frequency of 250 MHz. It achieves a peak performance of 1.22 GOPs computing one-shot kernels and 1.17 GOPs computing multi-shot kernels. The best energy efficiency is 72.68 MOPs/mW for one-shot kernels and 115.96 MOPs/mW for multi-shot kernels. The design integrates power and clock-gating techniques to tailor the architecture to the embedded domain while maintaining performance. The best speed-ups are 17.63x and 18.61x for one-shot and multi-shot kernels. The best energy savings in the SoC are 9.05x and 11.10x for one-shot and multi-shot kernels.
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “ADRES: An Architecture with Tightly Coupled VLIW Processor and Coarse-Grained Reconfigurable Matrix,” in Field Programmable Logic and Application, P. Y. K. Cheung and G. A. Constantinides, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2003, pp. 61–70.
- H. Singh, M.-H. Lee, G. Lu, F. Kurdahi, N. Bagherzadeh, and E. Chaves Filho, “MorphoSys: an integrated reconfigurable system for data-parallel and computation-intensive applications,” IEEE Transactions on Computers, vol. 49, no. 5, pp. 465–481, 2000.
- R. Prabhakar, Y. Zhang, D. Koeplinger, M. Feldman, T. Zhao, S. Hadjis, A. Pedram, C. Kozyrakis, and K. Olukotun, “Plasticine: A Reconfigurable Architecture For Parallel Paterns,” in Proceedings of the 44th Annual International Symposium on Computer Architecture, ser. ISCA ’17. New York, NY, USA: Association for Computing Machinery, 2017, p. 389–402. [Online]. Available: https://doi.org/10.1145/3079856.3080256
- G. Gobieski, S. Ghosh, M. Heule, T. Mowry, T. Nowatzki, N. Beckmann, and B. Lucia, “RipTide: A Programmable, Energy-Minimal Dataflow Compiler and Architecture,” in 2022 55th IEEE/ACM International Symposium on Microarchitecture (MICRO), 2022, pp. 546–564.
- L. Liu, J. Zhu, Z. Li, Y. Lu, Y. Deng, J. Han, S. Yin, and S. Wei, “A survey of coarse-grained reconfigurable architecture and design: Taxonomy, challenges, and applications,” ACM Comput. Surv., vol. 52, no. 6, oct 2019. [Online]. Available: https://doi.org/10.1145/3357375
- A. Podobas, K. Sano, and S. Matsuoka, “A Survey on Coarse-Grained Reconfigurable Architectures From a Performance Perspective,” IEEE Access, vol. 8, pp. 146 719–146 743, 2020.
- J. Weng, S. Liu, Z. Wang, V. Dadu, and T. Nowatzki, “A hybrid systolic-dataflow architecture for inductive matrix algorithms,” in 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), 2020, pp. 703–716.
- C. Kim, M. Chung, Y. Cho, M. Konijnenburg, S. Ryu, and J. Kim, “ULP-SRP: Ultra Low-Power Samsung Reconfigurable Processor for Biomedical Applications,” ACM Trans. Reconfigurable Technol. Syst., vol. 7, no. 3, sep 2014. [Online]. Available: https://doi.org/10.1145/2629610
- V. Govindaraju, C.-H. Ho, T. Nowatzki, J. Chhugani, N. Satish, K. Sankaralingam, and C. Kim, “DySER: Unifying Functionality and Parallelism Specialization for Energy-Efficient Computing,” IEEE Micro, vol. 32, no. 5, pp. 38–51, 2012.
- F. Liu, H. Ahn, S. R. Beard, T. Oh, and D. I. August, “DynaSpAM: dynamic spatial architecture mapping using out of order instruction schedules,” SIGARCH Comput. Archit. News, vol. 43, no. 3S, p. 541–553, jun 2015. [Online]. Available: https://doi.org/10.1145/2872887.2750414
- H. Park, K. Fan, S. A. Mahlke, T. Oh, H. Kim, and H.-s. Kim, “Edge-Centric modulo Scheduling for Coarse-Grained Reconfigurable Architectures,” in Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, ser. PACT ’08. New York, NY, USA: Association for Computing Machinery, 2008, p. 166–176. [Online]. Available: https://doi.org/10.1145/1454115.1454140
- T. Nowatzki, V. Gangadhar, N. Ardalani, and K. Sankaralingam, “Stream-Dataflow Acceleration,” SIGARCH Comput. Archit. News, vol. 45, no. 2, p. 416–429, jun 2017. [Online]. Available: https://doi.org/10.1145/3140659.3080255
- M. Mishra, T. J. Callahan, T. Chelcea, G. Venkataramani, S. C. Goldstein, and M. Budiu, “Tartan: evaluating spatial computation for whole program execution,” in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS XII. New York, NY, USA: Association for Computing Machinery, 2006, p. 163–174. [Online]. Available: https://doi.org/10.1145/1168857.1168878
- M. Karunaratne, A. K. Mohite, T. Mitra, and L.-S. Peh, “HyCUBE: A CGRA with Reconfigurable Single-Cycle Multi-Hop Interconnect,” in Proceedings of the 54th Annual Design Automation Conference 2017, ser. DAC ’17. New York, NY, USA: Association for Computing Machinery, 2017. [Online]. Available: https://doi.org/10.1145/3061639.3062262
- S. Swanson, K. Michelson, A. Schwerin, and M. Oskin, “WaveScalar,” in Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., 2003, pp. 291–302.
- D. Voitsechov, O. Port, and Y. Etsion, “Inter-thread communication in multithreaded, reconfigurable coarse-grain arrays,” in 2018 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2018, pp. 42–54.
- H. Jacobson, P. Kudva, P. Bose, P. Cook, S. Schuster, E. Mercer, and C. Myers, “Synchronous interlocked pipelines,” in Proceedings Eighth International Symposium on Asynchronous Circuits and Systems, 2002, pp. 3–12.
- J. Cortadella, M. Kishinevsky, and B. Grundmann, “Synthesis of Synchronous Elastic Architectures,” in Proceedings of the 43rd Annual Design Automation Conference, ser. DAC ’06. New York, NY, USA: Association for Computing Machinery, 2006, p. 657–662. [Online]. Available: https://doi.org/10.1145/1146909.1147077
- Y. Huang, P. Ienne, O. Temam, Y. Chen, and C. Wu, “Elastic CGRAs,” in Proceedings of the ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’13. New York, NY, USA: Association for Computing Machinery, 2013, p. 171–180. [Online]. Available: https://doi.org/10.1145/2435264.2435296
- C. Torng, P. Pan, Y. Ou, C. Tan, and C. Batten, “Ultra-Elastic CGRAs for Irregular Loop Specialization,” in 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2021, pp. 412–425.
- O. Ragheb, R. Beidas, and J. Anderson, “Statically Scheduled vs. Elastic CGRA Architectures: Impact on Mapping Feasibility,” in 2023 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), 2023, pp. 468–475.
- O. Ragheb, T. Yu, D. Ma, and J. Anderson, “Modeling and Exploration of Elastic CGRAs,” in 2022 32nd International Conference on Field-Programmable Logic and Applications (FPL), 2022, pp. 404–410.
- J. Weng, S. Liu, V. Dadu, Z. Wang, P. Shah, and T. Nowatzki, “Dsagen: Synthesizing programmable spatial accelerators,” in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), 2020, pp. 268–281.
- C. Yin, N. Jing, J. Jiang, Q. Wang, and Z. Mao, “A reschedulable dataflow-simd execution for increased utilization in cgra cross-domain acceleration,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 42, no. 3, pp. 874–886, 2023.
- S. Das, K. J. M. Martin, D. Rossi, P. Coussy, and L. Benini, “An Energy-Efficient Integrated Programmable Array Accelerator and Compilation Flow for Near-Sensor Ultralow Power Processing,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 38, no. 6, pp. 1095–1108, 2019.
- D. Capalija and T. S. Abdelrahman, “A high-performance overlay architecture for pipelined execution of data flow graphs,” in 2013 23rd International Conference on Field programmable Logic and Applications, 2013, pp. 1–8.
- R. Zamacola, A. Otero, and E. de la Torre, “Multi-grain reconfigurable and scalable overlays for hardware accelerator composition,” Journal of Systems Architecture, vol. 121, p. 102302, 2021. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S1383762121002083
- S. Machetti, P. D. Schiavone, T. C. Müller, M. Peón-Quirós, and D. Atienza, “X-HEEP: An Open-Source, Configurable and Extendible RISC-V Microcontroller for the Exploration of Ultra-Low-Power Edge Accelerators,” arXiv preprint arXiv:2401.05548, 2024.
- P. D. Schiavone, F. Conti, D. Rossi, M. Gautschi, A. Pullini, E. Flamand, and L. Benini, “Slow and steady wins the race? a comparison of ultra-low-power risc-v cores for internet-of-things applications,” in 2017 27th International Symposium on Power and Timing Modeling, Optimization and Simulation (PATMOS). IEEE, 2017, pp. 1–8.
- D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Gürkaynak, A. Teman, J. Constantin, A. Burg, I. Miro-Panades, E. Beignè, F. Clermidy, P. Flatresse, and L. Benini, “Energy-efficient near-threshold parallel computing: The pulpv2 cluster,” IEEE Micro, vol. 37, no. 5, pp. 20–31, 2017.
- M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E. Flamand, F. K. Gürkaynak, and L. Benini, “Near-threshold RISC-V core with DSP extensions for scalable IoT endpoint devices,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700–2713, 2017.
- L.-N. Pouchet, “PolyBench/C version 4.2.1,” https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/, accessed: 2023-11-25.
- N. Ozaki, Y. Yoshihiro, Y. Saito, D. Ikebuchi, M. Kimura, H. Amano, H. Nakamura, K. Usami, M. Namiki, and M. Kondo, “Cool Mega-Array: A highly energy efficient reconfigurable accelerator,” in 2011 International Conference on Field-Programmable Technology, 2011, pp. 1–8.
- T. K. Bandara, D. Wijerathne, T. Mitra, and L.-S. Peh, “REVAMP: A Systematic Framework for Heterogeneous CGRA Realization,” in Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ser. ASPLOS ’22. New York, NY, USA: Association for Computing Machinery, 2022, p. 918–932. [Online]. Available: https://doi.org/10.1145/3503222.3507772