Enabling Efficient Hybrid Systolic Computation in Shared L1-Memory Manycore Clusters (2402.12986v2)
Abstract: Systolic arrays and shared-L1-memory manycore clusters are commonly used architectural paradigms that offer different trade-offs to accelerate parallel workloads. While the first excel with regular dataflow at the cost of rigid architectures and complex programming models, the second are versatile and easy to program but require explicit dataflow management and synchronization. This work aims at enabling efficient systolic execution on shared-L1-memory manycore clusters. We devise a flexible architecture where small and energy-efficient RISC-V cores act as the systolic array's processing elements (PEs) and can form diverse, reconfigurable systolic topologies through queues mapped in the cluster's shared memory. We introduce two low-overhead RISC-V ISA extensions for efficient systolic execution, namely Xqueue and Queue-linked registers (QLRs), which support queue management in hardware. The Xqueue extension enables single-instruction access to shared-memory-mapped queues, while QLRs allow implicit and autonomous access to them, relieving the cores of explicit communication instructions. We demonstrate Xqueue and QLRs in MemPool, an open-source shared-memory cluster with 256 PEs, and analyze the hybrid systolic-shared-memory architecture's trade-offs on several DSP kernels with diverse arithmetic intensity. For an area increase of just 6%, our hybrid architecture can double MemPool's compute unit utilization, reaching up to 73%. In typical conditions (TT/0.80V/25{\deg}C), in a 22 nm FDX technology, our hybrid architecture runs at 600 MHz with no frequency degradation and is up to 65% more energy efficient than the shared-memory baseline, achieving up to 208 GOPS/W, with up to 63% of power spent in the PEs.
- S. Riedel, G. H. Khov, S. Mazzola, M. Cavalcante, R. Andri, and L. Benini, “Mempool meets systolic: Flexible systolic computation in a large shared-memory processor cluster,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–2.
- J. L. Hennessy and D. A. Patterson, “A new golden age for computer architecture,” Communications of the ACM, vol. 62, no. 2, pp. 48–60, 2019.
- R. Muralidhar, R. Borovica-Gajic, and R. Buyya, “Energy efficient computing systems: Architectures, abstractions and modeling to techniques and standards,” ACM Computing Surveys (CSUR), vol. 54, no. 11s, pp. 1–37, 2022.
- M. B. Taylor, “Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse,” in Proceedings of the 49th annual design automation conference, 2012, pp. 1131–1136.
- Apple Corp., “Apple unveils M3, M3 Pro, and M3 Max, the most advanced chips for a personal computer,” 2023. [Online]. Available: https://nr.apple.com/Di5I4t7da8
- Intel Corporation, “Intel® core™ i9-14900k processor,” 2023. [Online]. Available: https://www.intel.com/content/www/us/en/products/sku/236773/intel-core-i9-processor-14900k-36m-cache-up-to-6-00-ghz/specifications.html
- NVIDIA Corp., “NVIDIA H100 Tensor Core GPU Architecture,” NVIDIA Corp., Tech. Rep., 2022. [Online]. Available: https://www.nvidia.com/en-us/data-center/h100/
- GreenWaves Technologies SAS, “GAP9 next generation processor for hearables and smart sensors,” GreenWaves Technologies SAS, Tech. Rep., 2021. [Online]. Available: https://greenwaves-technologies.com/wp-content/uploads/2022/06/Product-Brief-GAP9-Sensors-General-V1_14.pdf
- R. Ginosar, P. Aviely, T. Israeli, and H. Meirov, “RC64: High performance rad-hard manycore,” IEEE Aerosp. Conf. Proc., pp. 2074–2082, Jun. 2016.
- S. Riedel, M. Cavalcante, R. Andri, and L. Benini, “Mempool: A scalable manycore architecture with a low-latency shared l1 memory,” IEEE Transactions on Computers, vol. 72, no. 12, pp. 3561–3575, 2023.
- H.-T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1, pp. 37–46, 1982.
- J. Redgrave, A. Meixner, N. Goulding-Hotta, A. Vasilyev, and O. Shacham, “Pixel Visual Core: Google’s fully programmable image, vision and AI processor for mobile devices,” in 2018 IEEE Hot Chips 30 Symposium (HC30). Cupertino, US: IEEE Technical Committee on Microprocessors and Microcomputers, Aug. 2018.
- N. P. Jouppi et al., “In-datacenter performance analysis of a tensor processing unit,” in Proceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12.
- W. Sun, D. Liu, Z. Zou, W. Sun, S. Chen, and Y. Kang, “Sense: Model-hardware codesign for accelerating sparse cnns on systolic arrays,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 31, no. 4, pp. 470–483, 2023.
- Y.-H. Chen, T. Krishna, J. S. Emer, and V. Sze, “Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks,” IEEE journal of solid-state circuits, vol. 52, no. 1, pp. 127–138, 2016.
- A. Podobas, K. Sano, and S. Matsuoka, “A survey on coarse-grained reconfigurable architectures from a performance perspective,” IEEE Access, vol. 8, pp. 146 719–146 743, 2020.
- J. Fornt et al., “An energy-efficient gemm-based convolution accelerator with on-the-fly im2col,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 2023.
- A. Fuchs and D. Wentzlaff, “The accelerator wall: Limits of chip specialization,” in 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA). IEEE, 2019, pp. 1–14.
- R. Duncan, “A Survey of Parallel Computer Architectures,” Computer, vol. 23, no. 2, pp. 5–16, 1990.
- C.-P. Lin et al., “A 5mw mpeg4 sp encoder with 2d bandwidth-sharing motion estimation for mobile applications,” in 2006 IEEE International Solid State Circuits Conference-Digest of Technical Papers. IEEE, 2006, pp. 1626–1635.
- B. Mei, S. Vernalde, D. Verkest, H. De Man, and R. Lauwereins, “ADRES: An architecture with tightly coupled VLIW processor and coarse-grained reconfigurable matrix,” Lecture Notes in Computer Science, vol. 2778, pp. 61–70, 2003.
- M. Karunaratne, A. K. Mohite, T. Mitra, and L. S. Peh, “HyCUBE: A CGRA with Reconfigurable Single-cycle Multi-hop Interconnect,” Proceedings - Design Automation Conference, vol. Part 128280, jun 2017.
- B. Bohnenstiehl et al., “Kilocore: A 32-nm 1000-processor computational array,” IEEE Journal of Solid-State Circuits, vol. 52, no. 4, pp. 891–902, 2017.
- S. Davidson et al., “The celerity open-source 511-core risc-v tiered accelerator fabric: Fast architectures and design methodologies for fast chips,” IEEE Micro, vol. 38, no. 2, pp. 30–41, 2018.
- A. Olofsson, “Epiphany-V: A 1024 processor 64-bit RISC system-on-chip,” arXiv preprint arXiv:1610.01832, 2016.
- D. Melpignano et al., “Platform 2012, a many-core computing accelerator for embedded socs: Performance evaluation of visual analytics applications,” in Proceedings of the 49th Annual Design Automation Conference, ser. DAC ’12. New York, NY, USA: Association for Computing Machinery, 2012, p. 1137–1142. [Online]. Available: https://doi.org/10.1145/2228360.2228568
- A. A. D. Farahani, H. Beitollahi, and M. Fathi, “A dynamic general accelerator for integer and fixed-point processing,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 28, no. 12, pp. 2509–2517, 2020.
- A. Pathania, “Scalable task schedulers for many-core architectures,” Ph.D. dissertation, Karlsruher Institut für Technologie (KIT), 2018.
- F. Schuiki, F. Zaruba, T. Hoefler, and L. Benini, “Stream semantic registers: A lightweight risc-v isa extension achieving full compute utilization in single-issue cores,” IEEE Transactions on Computers, vol. 70, no. 2, pp. 212–227, 2020.
- S. L. Johnsson and R. L. Krawitz, “Cooley-tukey fft on the connection machine,” Parallel Computing, vol. 18, no. 11, pp. 1201–1221, 1992.
- M. Bertuletti, Y. Zhang, A. Vanelli-Coralli, and L. Benini, “Efficient parallelization of 5g-pusch on a scalable risc-v many-core processor,” in 2023 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2023, pp. 1–6.
- Sergio Mazzola (4 papers)
- Samuel Riedel (12 papers)
- Luca Benini (362 papers)