Papers
Topics
Authors
Recent
2000 character limit reached

SARIS: Accelerating Stencil Computations on Energy-Efficient RISC-V Compute Clusters with Indirect Stream Registers (2404.05303v1)

Published 8 Apr 2024 in cs.MS and cs.AR

Abstract: Stencil codes are performance-critical in many compute-intensive applications, but suffer from significant address calculation and irregular memory access overheads. This work presents SARIS, a general and highly flexible methodology for stencil acceleration using register-mapped indirect streams. We demonstrate SARIS for various stencil codes on an eight-core RISC-V compute cluster with indirect stream registers, achieving significant speedups of 2.72x, near-ideal FPU utilizations of 81%, and energy efficiency improvements of 1.58x over an RV32G baseline on average. Scaling out to a 256-core manycore system, we estimate an average FPU utilization of 64%, an average speedup of 2.14x, and up to 15% higher fractions of peak compute than a leading GPU code generator.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)
  1. A High-performance, Energy-efficient Modular DMA Engine Architecture. IEEE Trans. Comput. 73, 1 (2024), 263–277.
  2. SODA: Stencil with Optimized Dataflow Architecture. In 2018 IEEE/ACM Int. Conf. Computer-Aided Design (ICCAD). IEEE, New York, NY, USA, 1–8.
  3. Casper: Accelerating Stencil Computations Using Near-Cache Processing. IEEE Access 11 (2021), 22136–22154.
  4. Unlimited Vector Extension with Data Streaming Support. In 2021 ACM/IEEE 48th Annu. Int. Symp. Comput. Architecture (ISCA). IEEE, New York, NY, USA, 209–222.
  5. Scalable Distributed High-Order Stencil Computations. In SC ’22: Proc. Int. Conf. High Perform. Comput., Netw., Storage Analysis. IEEE Press, New York, NY, USA, Article 30, 13 pages.
  6. AN5D: Automated Stencil Framework for High-Degree Temporal Blocking on GPUs. In Proc. 18th ACM/IEEE Int. Symp. Code Gener. Optim. Association for Computing Machinery, New York, NY, USA, 199–211.
  7. Louis-Noël Pouchet. 2015. Polybench/C: The polyhedral benchmark suite. https://web.cse.ohio-state.edu/~pouchet.2/software/polybench/
  8. On Optimizing Complex Stencils on GPUs. In 2019 IEEE Int. Parallel Distrib. Process. Symp. (IPDPS). IEEE, New York, NY, USA, 641–652.
  9. Fast Stencil-Code Computation on a Wafer-Scale Processor. In SC20: Int. Conf. High Perf. Comput., Netw., Storage Analysis. IEEE Press, New York, NY, USA, Article 58, 14 pages.
  10. Sparse Stream Semantic Registers: A Lightweight ISA Extension Accelerating General Sparse Linear Algebra. IEEE Trans. Parallel Distrib. Syst. 34 (2023), 3147–3161.
  11. Stream Semantic Registers: A Lightweight RISC-V ISA Extension Achieving Full Compute Utilization in Single-Issue Cores. IEEE Trans. Comput. 70 (2021), 212–227.
  12. NERO: A Near High-Bandwidth Memory Stencil Accelerator for Weather Prediction Modeling. In 2020 30th Int. Conf. Field-Programmable Logic Appl. (FPL). IEEE, New York, NY, USA, 9–17.
  13. Zhengrong Wang and Tony Nowatzki. 2019. Stream-based Memory Access Specialization for General Purpose Processors. In 2019 ACM/IEEE 46th Annu. Int. Symp. Comput. Architecture (ISCA). IEEE, New York, NY, USA, 736–749.
  14. DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU. In 2021 IEEE 23rd Int. Conf. High Perform. Comput. Commun.; 7th Int. Conf. Data Science Syst.; 19th Int. Conf. Smart City; 7th Int. Conf. Dependability in Sensor, Cloud Big Data Syst. Appl. (HPCC/DSS/SmartCity/DependSys). IEEE, New York, NY, USA, 63–70.
  15. Charles R. Yount. 2015. Vector Folding: Improving Stencil Performance via Multi-dimensional SIMD-vector Representation. In 2015 IEEE 17th Int. Conf. High Perform. Comput. Commun., 2015 IEEE 7th Int. Symp. Cyberspace Saf. Secur., 2015 IEEE 12th Int. Conf. Embedded Softw. Syst. IEEE, New York, NY, USA, 865–870.
  16. Manticore: A 4096-Core RISC-V Chiplet Architecture for Ultraefficient Floating-Point Computing. IEEE Micro 41, 2 (2021), 36–42.
  17. Snitch: A Tiny Pseudo Dual-Issue Processor for Area and Energy Efficient Execution of Floating-Point Intensive Workloads. IEEE Trans. Comput. 70 (2020), 1845–1860.
  18. Data Layout Transformation for Stencil Computations Using ARM NEON Extension. In 2020 IEEE 22nd Int. Conf. High Perform. Comput. and Commun.; IEEE 18th Int. Conf. Smart City; IEEE 6th Int. Conf. Data Science Syst. (HPCC/SmartCity/DSS). IEEE, New York, NY, USA, 180–188.
  19. Revisiting Temporal Blocking Stencil Optimizations. In Proc. 37th Int. Conf. Supercomputing. Association for Computing Machinery, New York, NY, USA, 251–263.
  20. Delivering Performance-Portable Stencil Computations on CPUs and GPUs Using Bricks. In 2018 IEEE/ACM Int. Workshop on Perform., Portability Productivity HPC (P3HPC). IEEE, New York, NY, USA, 59–70.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 38 likes about this paper.