Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts (2401.13645v1)
Abstract: It is well known that to accelerate stencil codes on CPUs or GPUs and to exploit hardware caches and their lines optimizers must find spatial and temporal locality of array accesses to harvest data-reuse opportunities. On FPGAs there is the burden that there are no built-in caches (or only pre-built hardware descriptions for cache blocks that are inefficient for stencil codes). But this paper demonstrates that this lack is also a chance as polyhedral methods can be used to generate stencil-specific cache-structures of the right sizes on the FPGA and to fill and flush them efficiently with wide bursts during stencil execution. The paper shows how to derive the appropriate directives and code restructurings from stencil codes so that the FPGA compiler generates fast stencil hardware. Switching on our optimization improves the runtime of a set of 10 stencils by between 43x and 156x.
- Optimizing remote accesses for offloaded kernels: Application to high-level synthesis for FPGA. In Proc. Intl. Conf. on Design, Automation and Test in Europe (DATE’13). Grenoble, France, 575–580.
- OpenTuner: An extensible framework for program autotuning. In Proc. Intl. Conf. on Parallel Architecture and Compilation Techniques (PACT’14). Edmonton, Canada, 303–315.
- The Polyhedral Model Is More Widely Applicable Than You Think. In Proc. Intl. Conf. on Compiler Construction (CC’10). Paphos, Cyprus, 283–303.
- A Practical Automatic Polyhedral Parallelizer and Locality Optimizer. In Proc Intl. Conf. on Programming Language Design and Implementation (PLDI’08). Tucson, AZ, 101–113.
- Loop Parallelization Algorithms: From Parallelism Extraction to Code Generation. Parallel Comput. 24, 3–4, Article 5 (May 1998), 24 pages.
- Multipurpose Cacheing to Accelerate OpenMP Target Regions on FPGAs. In Proc. Intl. Conf. on OpenMP (IWOMP’23). Bristol, UK, 147–162.
- Array-Specific Dataflow Caches for High-Level Synthesis of Memory-Intensive Algorithms on FPGAs. IEEE Access 10 (2022), 118858–118877.
- Source-to-Source Optimization for HLS. In FPGAs for Software Programmers, Dirk Koch, Frank Hanning, and Daniel Ziener (Eds.). Springer International Publishing, Basel, Switzerland, Chapter 8, 137–163.
- Towards Scalable and Efficient FPGA Stencil Accelerators. In Proc. Intl. Workshop Polyhedral Compilation Techniques (IMPACT’16). Prague, Czech Republic.
- Paul Feautrier. 1992. Some efficient solutions to the affine scheduling problem. I. One-dimensional time. Intl. Journal of Parallel Programming 21, 5 (Oct. 1992), 313–347.
- Björn Franke and Michael O’Boyle. 2003. Array Recovery and High-Level Transformations for DSP Applications. ACM Trans. Embedded Computing Systems 2, 2 (May 2003), 132–162.
- Polyhedral AST Generation Is More Than Scanning Polyhedra. ACM Trans. Program. Lang. Syst. 37, 4, Article 12 (July 2015), 50 pages.
- DRDU: A Data Reuse Analysis Technique for Efficient Scratch-Pad Memory Management. ACM Trans. Design Automation of Electronic Systems 12, 2, Article 15 (Apr. 2007), 28 pages.
- Nick Johnson. 2015. The Adept Benchmark Suite. Retrieved November 10, 2023 from https://github.com/EPCCed/adept-kernel-openmp
- From Domain-Specific Languages to Memory-Optimized Accelerators for Fluid Dynamics. In Intl. Conf. on Cluster Computing (CLUSTER’21). Portland, OR, 759–766.
- The Organization of Computations for Uniform Recurrence Equations. J. ACM 14, 3, Article 12 (July 1967), 28 pages.
- Data-Centric Multi-Level Blocking. In Proc. Intl. Conf. on Programming Language Design and Implementation (PLDI’97). Las Vegas, NV, 346–357.
- The TaPaSCo Open-Source Toolflow for the Automated Composition of Task-Based Parallel Reconfigurable Computing Systems. In Proc. Intl. Symp. on Applied Reconfigurable Computing (ARC’19). Darmstadt, Germany, 214–229.
- Loop Splitting for Efficient Pipelining in High-Level Synthesis. In Proc. Intl. Symp. on Field-Programmable Custom Computing Machines (FCCM’16). Washington, DC, 72–79.
- Acceleration by Inline Cache for Memory-Intensive Algorithms on FPGA via High-Level Synthesis. IEEE Access 5 (2017), 18953–18974.
- The ORKA-HPC Compiler—Practical OpenMP for FPGAs. In Proc. Intl. Workshop Languages and Compilers for Parallel Computing (LCPC’21). Newark, DE, 83–97.
- Replication Package for “Employing polyhedral methods to optimize stencils on FPGAs with stencil-specific caches, data reuse, and wide data bursts”. https://doi.org/10.5281/zenodo.10396084
- Wim Meeus and Dirk Stroobandt. 2018. Data Reuse Buffer Synthesis Using the Polyhedral Model. IEEE Trans. Very Large Scale Integration (VLSI) Systems 26, 7, Article 12 (July 2018), pp. 1340–1353 pages.
- Productivity via Automatic Code Generation for PGAS Platforms with the R-Stream Compiler. In Proc. Intl. Workshop Asynchrony in the PGAS Programming Model (APGAS’09). Yorktown Heights, NY.
- R-Stream Compiler. In Encyclopedia of Parallel Computing, David A. Padua (Ed.). Springer US, 1756–1765.
- A polyhedral model-based framework for dataflow implementation on FPGA devices of Iterative Stencil Loops. In Proc. Intl. Conf. on Computer-Aided Design (ICCAD’16). Austin, TX, 1–8.
- Louis-Noël Pouchet. 2015. PolyBench/C – The Polyhedral Benchmark Suite. Retrieved November 10, 2023 from http://web.cse.ohio-state.edu/~pouchet.2/software/polybench/
- Polyhedral-Based Data Reuse Optimization for Configurable Computing. In Proc. Intl. Symp. on Field Programmable Gate Arrays (FPGA’13). Montery, CA, 29–38.
- Data Reuse Exploration Techniques for Loop-Dominated Applications. In Proc. Intl. Conf. on Design, Automation and Test in Europe (DATE’02). Valencia, Spain, 428–435.
- Sven Verdoolaege. 2010. ISL: An Integer Set Library for the Polyhedral Model. In Proc. Intl. Congress on Mathematical Software (ICMS’10). Kobe, Japan, 299–302.
- Polyhedral Parallel Code Generation for CUDA. ACM Trans. Architecture and Code Optimization 9, 4, Article 54 (Jan 2013), 23 pages.
- Sven Verdoolaege and Tobias Grosser. 2012. Polyhedral Extraction Tool. In Proc. Intl. Workshop Polyhedral Compilation Techniques (IMPACT’12). Paris, France.
- Generating SIMD Instructions for Cerebras CS-1 using Polyhedral Compilation Techniques. In Proc. Intl. Workshop Polyhedral Compilation Techniques (IMPACT’20). Bologna, Italy.
- AutoSA: A Polyhedral Compiler for High-Performance Systolic Arrays on FPGA. In Proc. Intl. Symp. on Field-Programmable Gate Arrays (FPGA’21). Virtual Event, 93–104.
- Custom-sized caches in application-specific memory hierarchies. In Proc. Intl. Conf. on Field Programmable Technology (FPT’15). Queenstown, New Zealand, 144–151.
- Michael E. Wolf and Monica S. Lam. 1991. A Data Locality Optimizing Algorithm. In Proc. Intl. Conf. on Programming Language Design and Implementation (PLDI’91). Toronto, Canada, 30–44.
- Xilinx. 2023. Vitis High-Level Synthesis User Manual. Retrieved November 10, 2023 from https://docs.xilinx.com/r/en-US/ug1399-vitis-hls
- Florian Mayer (6 papers)
- Julian Brandner (1 paper)
- Michael Philippsen (3 papers)