A Statically and Dynamically Scalable Soft GPGPU

Published 8 Jan 2024 in cs.AR | (2401.04261v1)

Abstract: Current soft processor architectures for FPGAs do not utilize the potential of the massive parallelism available. FPGAs now support many thousands of embedded floating point operators, and have similar computational densities to GPGPUs. Several soft GPGPU or SIMT processors have been published, but the reported large areas and modest Fmax makes their widespread use unlikely for commercial designs. In this paper we take an alternative approach, building the soft GPU microarchitecture around the FPGA resource mix available. We demonstrate a statically scalable soft GPGPU processor (where both parameters and feature set can be determined at configuration time) that always closes timing at the peak speed of the slowest embedded component in the FPGA (DSP or hard memory), with a completely unconstrained compile into a current Intel Agilex FPGA. We also show dynamic scalability, where a subset of the thread space can be specified on an instruction-by-instruction basis. For one example core type, we show a logic range -- depending on the configuration -- of 4k to 10k ALMs, along with 24 to 32 DSP Blocks, and 50 to 250 M20K memories. All of these instances close timing at 771 MHz, a performance level limited only by the DSP Blocks. We describe our methodology for reliably achieving this clock rate by matching the processor pipeline structure to the physical structure of the FPGA fabric. We also benchmark several algorithms across a range of data sizes, and compare to a commercial soft RISC processor.

Abstract PDF HTML Upgrade to Chat

References (31)

Citations (1)

View on Semantic Scholar

Summary

The paper presents eGPU, a reconfigurable soft GPGPU that leverages both static and dynamic scalability for efficient FPGA resource utilization.
It demonstrates performance gains with clock frequencies up to 771 MHz and resource usage ranging from 4k to 10k logic elements and 250 M20K blocks.
The scalable design mitigates thread divergence with configurable predicate logic, enabling efficient SIMD processing on FPGA platforms.

Evaluation and Design of a Statically and Dynamically Scalable Soft GPGPU for FPGAs

The paper "A Statically and Dynamically Scalable Soft GPGPU" presents a novel approach to developing soft Graphics Processing Units (GPGPUs) tailored for Field-Programmable Gate Arrays (FPGAs). The core aim of this research is to leverage the FPGA's inherent massive parallelism and floating-point capabilities to build a soft GPGPU microarchitecture that is both statically and dynamically scalable. Traditional solutions often failed to thoroughly exploit the FPGA's potential due to constraints in area and clock speed. This paper seeks to alleviate these limitations through a carefully designed architecture that aligns with the FPGA resource mix and architecture.

Key Contributions and Results

The paper introduces a GPGPU microarchitecture known as the eGPU, which is highly adaptable to varying computational needs due to its parameterizable thread space and memory configurations. Through statically scalable features, users can tailor various architectural aspects at compile time including memory sizes and arithmetic logic capabilities. Meanwhile, dynamic scalability allows for adaptive execution, where subsets of threads can be processed on-the-fly to improve efficiency for certain computational tasks. This adaptability is critical for optimizing FPGA resource utilization and managing diverse workloads.

Numerical results demonstrate the eGPU's capability to sustain clock frequencies up to 771 MHz, only constrained by the DSP blocks within the FPGA. Depending on the configuration, instances of the eGPU can vary substantially in resource consumption, ranging from 4k to 10k logic elements and up to 250 M20K memory blocks. Comparative analysis against commercial soft RISC processors indicates substantial gains in performance, governed by the parallel processing advantages inherent in the eGPU's architecture.

Architectural Insights

The design of eGPU focuses on both efficiency and performance, using the native structure of FPGAs to close timing at high frequencies effectively. Soft GPU designs tend to be homogenous, but the eGPU's ability to parameterize functions means that both logic area and performance can be tuned to local requirements. This involves mapping several functional units like DSPs and memory blocks directly to FPGA architectures, balancing logic depth against these hard resources to achieve optimal performance.

Moreover, the eGPU mitigates the common performance bottleneck of thread divergence through user-configurable predicate logic. This provides a sophisticated mechanism for managing conditional thread execution, aiding the efficiency of SIMD processing. Dynamic thread scaling can significantly reduce cycle counts for operations where only subsets of threads are active, achieving performance advantages for workloads typical of GPGPU applications such as reduction operations.

Implications and Future Directions

The creation of a scalable, high-frequency, soft GPGPU on FPGAs presents opportunities to reshape software-hardware co-design strategies. By decoupling performance from the constraints of static configurations, this research bridges a gap that traditionally existed between flexible soft-core implementations and performance-oriented hard-core designs. Practically, this could lead to more cost-effective solutions for embedded systems requiring GPU-like capabilities but operating within custom constraints.

Future developments might explore refining the eGPU's synthesis and routing processes to further boost performance and reduce power consumption. Additionally, compiler development could enable more intuitive programming models, further democratizing the GPGPU's potential usage in diverse applications. This research opens pathways for integrating more sophisticated AI processing capabilities into FPGA solutions, potentially influencing fields such as autonomous systems and real-time data processing where configurable, high-throughput computation is essential.