- The paper presents the initial porting of the Cooley-Tukey FFT on a single Tensix core, identifying data reordering as a critical performance bottleneck.
- The authors apply optimization techniques such as chunking and 128-bit memory accesses to reduce runtime and address overflow issues.
- Scaling the FFT to 2D across 64 cores demonstrates significant energy efficiency, outperforming Xeon CPUs in power savings.
The paper "Exploring Fast Fourier Transforms on the Tenstorrent Wormhole" presents a detailed examination of porting and optimizing the Cooley-Tukey Fast Fourier Transform (FFT) algorithm for the Tenstorrent Wormhole PCIe RISC-V based accelerator. The Tenstorrent Wormhole technology is an example of a RISC-V based accelerator designed primarily for AI and ML workloads, yet the paper posits its utility for high-performance computing (HPC), particularly due to its energy-efficient architecture.
Core Contributions
The authors conducted a thorough investigation into several optimization strategies for improving the performance of the FFT algorithm on the Tensix architecture. They emphasize data movement decoupled from compute as the central architectural feature, with distinct cores dedicated to each task, potentially beneficial for HPC applications.
- Initial Porting and Analysis: The FFT algorithm was initially ported to a single Tensix core, with performance measured against a single Xeon Platinum CPU core. This rudimentary implementation highlighted a significant bottleneck in data reordering, which posed challenges in optimizing for the Tensix architecture.
- Optimization Techniques: The authors explore various optimization strategies including chunking to enable concurrent operations, leveraging ThCon for data copying, and employing 128-bit wide memory accesses for contiguous data. These optimizations demonstrate reduced runtime, albeit with complexities, such as overflow issues that arose and were addressed via linker script modifications.
- Scaling to 2D FFT: Further scaling the FFT to a 2D execution across multiple Tensix cores illustrated the potential benefits of collective transposition operations, leveraging 64 cores in the n300 configuration, showing an overall energy efficiency superiority over a full Xeon CPU execution despite being slower.
Numerically, while single Tensix core performance for FFT was approximately 2.8 times lower than the Xeon CPU core, the architecture's energy efficiency was notable. For 2D FFTs of size 1024 by 1024 elements, the Wormhole used 64 Tensix cores and delivered 3.6 times better energy performance than its CPU counterpart by consuming eight times less power.
Implications and Future Work
The paper importantly showcases the adaptability of RISC-V accelerators in enhancing energy-efficient HPC applications. The findings suggest potential adjustments at the hardware-software interface that could improve application efficiency further. For instance, integrating tools and APIs to manage data reordering and enabling direct register mapping may significantly enhance performance.
Future work as suggested by the authors involves expanding support for larger problem sizes using external DRAM and refining data reordering techniques to allow single 128-bit memory access patterns, thus potentially elevating performance metrics further.
An intriguing avenue for follow-up research involves scaling across multiple Wormhole cards and exploring inter-card network capabilities to decisively tackle multidimensional FFT challenges.
Conclusion
This paper provides valuable insights into the intersection of RISC-V acceleration technology with computational needs typical within HPC contexts. It underscores the versatility required to bridge AI-centric hardware designs with traditional HPC needs, advocating for architectural modifications that could render such accelerators increasingly advantageous in a broader range of computational domains. The work lays a foundation for future exploration in large-scale, energy-efficient HPC applications powered by RISC-V accelerators like the Tenstorrent Wormhole.