Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Statically and Dynamically Scalable Soft GPGPU (2401.04261v1)

Published 8 Jan 2024 in cs.AR

Abstract: Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs. In this paper we take an alternative approach, building the soft GPU microarchitecture around the FPGA resource mix available. We demonstrate a statically scalable soft GPGPU processor (where both parameters and feature set can be determined at configuration time) that always closes timing at the peak speed of the slowest embedded component in the FPGA (DSP or hard memory), with a completely unconstrained compile into a current Intel Agilex FPGA. We also show dynamic scalability, where a subset of the thread space can be specified on an instruction-by-instruction basis. For one example core type, we show a logic range -- depending on the configuration -- of 4k to 10k ALMs, along with 24 to 32 DSP Blocks, and 50 to 250 M20K memories. All of these instances close timing at 771 MHz, a performance level limited only by the DSP Blocks. We describe our methodology for reliably achieving this clock rate by matching the processor pipeline structure to the physical structure of the FPGA fabric. We also benchmark several algorithms across a range of data sizes, and compare to a commercial soft RISC processor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. 2016. Nios II Classic Processor Reference GuideNios II Classic Processor Reference Guide. https://www.intel.com/content/www/us/en/docs/programmable/683620/current/overview-67435.html.
  2. 2017a. FFT IP Core: User Guide. https://www.intel.co.uk/content/www/uk/en/products/details/fpga/intellectual-property/dsp/fft.html.
  3. 2017b. High-speed Reed-Solomon IP Core User Guide. https://www.intel.com/content/www/us/en/docs/programmable/683120/17-1/about-the-high-speed-reed-solomon-ip-core.html.
  4. 2018. Microblaze Processor Reference Guide. https://docs.xilinx.com/v/u/2018.2-English/ug984-vivado-microblaze-ref.
  5. 2020. HB0919 Handbook CoreVectorBlox. https://www.microsemi.com/existing-parts/parts/152678.
  6. 2021. Block-by-Block Configurable Fast Fourier Transform Implementation on AI Engine (XAPP1356). https://docs.xilinx.com/r/en-US/xapp1356-fft-ai-engine/FFT-on-Multiple-AI-Engines.
  7. 2021. Intel Agilex7 FPGAs and SoCs F-Series: Product Table. https://www.intel.com/content/dam/www/programmable/us/en/pdfs/literature/pt/intel-agilex-f-series-product-table.pdf.
  8. 2022. Fast Fourier Transform v9.1. https://www.xilinx.com/content/dam/xilinx/support/documents/ip_documentation/xfft/v9_1/pg109-xfft.pdf.
  9. 2022. Nios V Processor Reference Manual. https://www.intel.com/content/www/us/en/products/details/fpga/nios-processor/v.html.
  10. 2023. Floating-Point IP Cores User Guide. https://www.intel.com/content/www/us/en/docs/programmable/683750/23-1/about-floating-point-ip-cores.html.
  11. 2023. Intel Agilex 7 Variable Precision DSP Blocks. https://www.intel.com/content/www/us/en/docs/programmable/683037/23-3/variable-precision-dsp-blocks-overview.html.
  12. 2023. Intel Agilex7 Embedded Memory User Guide. https://www.intel.com/content/www/us/en/docs/programmable/683241/23-2/embedded-memory-overview.html.
  13. 2023. Versal Adaptive SoC AI Engine Architecture Manual (AM009). https://docs.xilinx.com/v/u/en-US/wp506-ai-engine.
  14. Guppy: A GPU-like soft-core processor. In 2012 International Conference on Field-Programmable Technology. 57–60. https://doi.org/10.1109/FPT.2012.6412112
  15. FGPU: An SIMT-Architecture for FPGAs. In Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (Monterey, California, USA) (FPGA ’16). Association for Computing Machinery, New York, NY, USA, 254–263. https://doi.org/10.1145/2847263.2847273
  16. Kevin Andryc. 2018. An Architecture Evaluation and Implementaiton of a Soft GPGPU for FPGAs. (2018). https://doi.org/10.7275/12722172
  17. FlexGrip: A soft GPGPU for FPGAs. In 2013 International Conference on Field-Programmable Technology (FPT). 230–237. https://doi.org/10.1109/FPT.2013.6718358
  18. Enabling GPGPU Low-Level Hardware Explorations with MIAOW: An Open-Source RTL Implementation of a GPGPU. ACM Trans. Archit. Code Optim. 12, 2, Article 21 (jun 2015), 25 pages. https://doi.org/10.1145/2764908
  19. K. E. Batcher. 1968. Sorting Networks and Their Applications. In Proceedings of the April 30–May 2, 1968, Spring Joint Computer Conference (Atlantic City, New Jersey) (AFIPS ’68 (Spring)). Association for Computing Machinery, New York, NY, USA, 307–314. https://doi.org/10.1145/1468075.1468121
  20. The IDEA DSP Block-Based Soft Processor for FPGAs. ACM Trans. Reconfigurable Technol. Syst. 7, 3, Article 19 (sep 2014), 23 pages. https://doi.org/10.1145/2629443
  21. Analysis and optimization of a deeply pipelined FPGA soft processor. In 2014 International Conference on Field-Programmable Technology (FPT). 235–238. https://doi.org/10.1109/FPT.2014.7082783
  22. VEGAS: soft vector processor with scratchpad memory. In Proceedings of the ACM/SIGDA 19th International Symposium on Field Programmable Gate Arrays, FPGA 2011, Monterey, California, USA, February 27, March 1, 2011, John Wawrzynek and Katherine Compton (Eds.). ACM, 15–24. https://doi.org/10.1145/1950413.1950420
  23. Architectural Enhancements in Intel® Agilex™ FPGAs. In FPGA ’20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Seaside, CA, USA, February 23-25, 2020, Stephen Neuendorffer and Lesley Shannon (Eds.). ACM, 140–149. https://doi.org/10.1145/3373087.3375308
  24. SCRATCH: An End-to-End Application-Aware Soft-GPGPU Architecture and Trimming Tool. In Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture (Cambridge, Massachusetts) (MICRO-50 ’17). Association for Computing Machinery, New York, NY, USA, 165–177. https://doi.org/10.1145/3123939.3123953
  25. Jeffrey Kingyens and J. Gregory Steffan. 2010. A GPU-inspired soft processor for high-throughput acceleration. In 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum (IPDPSW). 1–8. https://doi.org/10.1109/IPDPSW.2010.5470679
  26. Ian Kuon and Jonathan Rose. 2006. Measuring the gap between FPGAs and ASICs. In Proceedings of the ACM/SIGDA 14th International Symposium on Field Programmable Gate Arrays, FPGA 2006, Monterey, California, USA, February 22-24, 2006, Steven J. E. Wilton and André DeHon (Eds.). ACM, 21–30. https://doi.org/10.1145/1117201.1117205
  27. Martin Langhammer and Gregg Baeckler. 2018. High Density and Performance Multiplication for FPGA. In 25th IEEE Symposium on Computer Arithmetic, ARITH 2018, Amherst, MA, USA, June 25-27, 2018. IEEE, 5–12. https://doi.org/10.1109/ARITH.2018.8464695
  28. Martin Langhammer and George A. Constantinides. 2023. eGPU: A 750 MHz Class Soft GPGPU for FPGA. In 2023 33rd International Conference on Field-Programmable Logic and Applications (FPL). 277–282. https://doi.org/10.1109/FPL60245.2023.00047
  29. DO-GPU: Domain Optimizable Soft GPUs. In 2021 31st International Conference on Field-Programmable Logic and Applications (FPL). 140–144. https://doi.org/10.1109/FPL53798.2021.00031
  30. Aaron Severance and Guy Lemieux. 2012. VENICE: A compact vector processor for FPGA applications. In 2012 International Conference on Field-Programmable Technology, FPT 2012, Seoul, Korea (South), December 10-12, 2012. IEEE, 261–268. https://doi.org/10.1109/FPT.2012.6412146
  31. Aaron Severance and Guy G. F. Lemieux. 2013. Embedded supercomputing in FPGAs with the VectorBlox MXP Matrix Processor. In Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis, CODES+ISSS 2013, Montreal, QC, Canada, September 29 - October 4, 2013. IEEE, 6:1–6:10. https://doi.org/10.1109/CODES-ISSS.2013.6658993
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
Citations (1)

Summary

  • The paper presents eGPU, a reconfigurable soft GPGPU that leverages both static and dynamic scalability for efficient FPGA resource utilization.
  • It demonstrates performance gains with clock frequencies up to 771 MHz and resource usage ranging from 4k to 10k logic elements and 250 M20K blocks.
  • The scalable design mitigates thread divergence with configurable predicate logic, enabling efficient SIMD processing on FPGA platforms.

Evaluation and Design of a Statically and Dynamically Scalable Soft GPGPU for FPGAs

The paper "A Statically and Dynamically Scalable Soft GPGPU" presents a novel approach to developing soft Graphics Processing Units (GPGPUs) tailored for Field-Programmable Gate Arrays (FPGAs). The core aim of this research is to leverage the FPGA's inherent massive parallelism and floating-point capabilities to build a soft GPGPU microarchitecture that is both statically and dynamically scalable. Traditional solutions often failed to thoroughly exploit the FPGA's potential due to constraints in area and clock speed. This paper seeks to alleviate these limitations through a carefully designed architecture that aligns with the FPGA resource mix and architecture.

Key Contributions and Results

The paper introduces a GPGPU microarchitecture known as the eGPU, which is highly adaptable to varying computational needs due to its parameterizable thread space and memory configurations. Through statically scalable features, users can tailor various architectural aspects at compile time including memory sizes and arithmetic logic capabilities. Meanwhile, dynamic scalability allows for adaptive execution, where subsets of threads can be processed on-the-fly to improve efficiency for certain computational tasks. This adaptability is critical for optimizing FPGA resource utilization and managing diverse workloads.

Numerical results demonstrate the eGPU's capability to sustain clock frequencies up to 771 MHz, only constrained by the DSP blocks within the FPGA. Depending on the configuration, instances of the eGPU can vary substantially in resource consumption, ranging from 4k to 10k logic elements and up to 250 M20K memory blocks. Comparative analysis against commercial soft RISC processors indicates substantial gains in performance, governed by the parallel processing advantages inherent in the eGPU's architecture.

Architectural Insights

The design of eGPU focuses on both efficiency and performance, using the native structure of FPGAs to close timing at high frequencies effectively. Soft GPU designs tend to be homogenous, but the eGPU's ability to parameterize functions means that both logic area and performance can be tuned to local requirements. This involves mapping several functional units like DSPs and memory blocks directly to FPGA architectures, balancing logic depth against these hard resources to achieve optimal performance.

Moreover, the eGPU mitigates the common performance bottleneck of thread divergence through user-configurable predicate logic. This provides a sophisticated mechanism for managing conditional thread execution, aiding the efficiency of SIMD processing. Dynamic thread scaling can significantly reduce cycle counts for operations where only subsets of threads are active, achieving performance advantages for workloads typical of GPGPU applications such as reduction operations.

Implications and Future Directions

The creation of a scalable, high-frequency, soft GPGPU on FPGAs presents opportunities to reshape software-hardware co-design strategies. By decoupling performance from the constraints of static configurations, this research bridges a gap that traditionally existed between flexible soft-core implementations and performance-oriented hard-core designs. Practically, this could lead to more cost-effective solutions for embedded systems requiring GPU-like capabilities but operating within custom constraints.

Future developments might explore refining the eGPU's synthesis and routing processes to further boost performance and reduce power consumption. Additionally, compiler development could enable more intuitive programming models, further democratizing the GPGPU's potential usage in diverse applications. This research opens pathways for integrating more sophisticated AI processing capabilities into FPGA solutions, potentially influencing fields such as autonomous systems and real-time data processing where configurable, high-throughput computation is essential.

Reddit Logo Streamline Icon: https://streamlinehq.com