Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 87 tok/s
Gemini 2.5 Pro 53 tok/s Pro
GPT-5 Medium 16 tok/s Pro
GPT-5 High 18 tok/s Pro
GPT-4o 105 tok/s Pro
GPT OSS 120B 471 tok/s Pro
Kimi K2 193 tok/s Pro
2000 character limit reached

Wafer-Scale Fast Fourier Transforms (2209.15040v1)

Published 29 Sep 2022 in cs.DC and cs.PF

Abstract: We have implemented fast Fourier transforms for one, two, and three-dimensional arrays on the Cerebras CS-2, a system whose memory and processing elements reside on a single silicon wafer. The wafer-scale engine (WSE) encompasses a two-dimensional mesh of roughly 850,000 processing elements (PEs) with fast local memory and equally fast nearest-neighbor interconnections. Our wafer-scale FFT (wsFFT) parallelizes a $n3$ problem with up to $n2$ PEs. At this point a PE processes only a single vector of the 3D domain (known as a pencil) per superstep, where each of the three supersteps performs FFT along one of the three axes of the input array. Between supersteps, wsFFT redistributes (transposes) the data to bring all elements of each one-dimensional pencil being transformed into the memory of a single PE. Each redistribution causes an all-to-all communication along one of the mesh dimensions. Given the level of parallelism, the size of the messages transmitted between pairs of PEs can be as small as a single word. In theory, a mesh is not ideal for all-to-all communication due to its limited bisection bandwidth. However, the mesh interconnecting PEs on the WSE lies entirely on-wafer and achieves nearly peak bandwidth even with tiny messages. This high efficiency on fine-grain communication allow wsFFT to achieve unprecedented levels of parallelism and performance. We analyse in detail computation and communication time, as well as the weak and strong scaling, using both FP16 and FP32 precision. With 32-bit arithmetic on the CS-2, we achieve 959 microseconds for 3D FFT of a $5123$ complex input array using a 512x512 subgrid of the on-wafer PEs. This is the largest ever parallelization for this problem size and the first implementation that breaks the millisecond barrier.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. heFFTe: Highly efficient FFT for exascale. In International Conference on Computational Science. Springer, Amsterdam, 262–275.
  2. Interim Report on Benchmarking FFT Libraries on High Performance Systems.
  3. The NAS Parallel Benchmarks. The International Journal of Supercomputing Applications 5, 3 (1991), 63–73. https://doi.org/10.1177/109434209100500306
  4. Optimizing bandwidth limited problems using one-sided communication and overlap. In Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE, Rhodes, 10–pp.
  5. James W Cooley and John W Tukey. 1965. An algorithm for the machine calculation of complex Fourier series. Mathematics of computation 19, 90 (1965), 297–301.
  6. Lisandro Dalcin. 2019. MPI for Python. https://github.com/mpi4py/mpi4py-fft.
  7. Fast parallel multidimensional FFT using advanced MPI. J. Parallel and Distrib. Comput. 128 (2019), 137–150.
  8. A Portable 3D FFT Package for Distributed-Memory Parallel Architectures.. In PPSC. SIAM, San Francisco, 70–71.
  9. Ian T Foster and Patrick H Worley. 1997. Parallel algorithms for the spectral transform method. SIAM Journal on Scientific Computing 18, 3 (1997), 806–837.
  10. M. Frigo and S.G. Johnson. 2005. The Design and Implementation of FFTW3. Proc. IEEE 93, 2 (2005), 216–231. https://doi.org/10.1109/JPROC.2004.840301
  11. Anshul Gupta and Vipin Kumar. 1993. The scalability of FFT on parallel computers. IEEE Transactions on Parallel and Distributed Systems 4, 8 (1993), 922–932.
  12. HACC: Simulating sky surveys on state-of-the-art supercomputing architectures. New Astronomy 42 (jan 2016), 49–65. https://doi.org/10.1016/j.newast.2015.06.003
  13. Intel. 2021. Math Kernel Library (MKL). https://software.intel.com/mkl.
  14. Rajkumar Kettimuthu and Sankara Muthukrishnan. 2005. A performance study of parallel FFT in clos and mesh networks. In Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA. 27–30.
  15. LAMMPS. 2021. Molecular Dynamics Simulator. https://www.lammps.org.
  16. Petascale direct numerical simulation of turbulent channel flow on up to 786k cores. In SC’13: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis. IEEE, 1–11.
  17. Large-scale discrete Fourier transform on TPUs. IEEE Access 9 (2021), 93422–93432.
  18. Kapil K Mathur and S Lennart Johnsson. 1995. All-to-all communication on the connection machine CM-200. Scientific Programming 4, 4 (1995), 251–273.
  19. Timothy Pickett Morgan. 2018. Peeling the Covers Off the Summit Supercomputer. https://www.nextplatform.com/2018/06/26/peeling-the-covers-off-the-summit-supercomputer/.
  20. Nvidia. 2021. CuFFT. https://developer.nvidia.com/cufft.
  21. ORNL. 2018. Summit Supercomputer. https://www.ornl.gov/news/ornl-launches-summit-supercomputer.
  22. Dmitry Pekurovsky. 2012. P3DFFT: A framework for parallel computations of Fourier transforms in three dimensions. SIAM Journal on Scientific Computing 34, 4 (2012), C192–C209.
  23. Michael Pippig. 2013. PFFT: An extension of FFTW to massively parallel architectures. SIAM Journal on Scientific Computing 35, 3 (2013), C213–C236.
  24. Steve Plimpton. 2018. fftMPI, a distributed-memory parallel FFT library. https://lammps.github.io/fftmpi/.
  25. Fast stencil-code computation on a wafer-scale processor. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 1–14.
  26. Paul N Swarztrauber. 1982. Vectorizing the FFTs. In Parallel computations. Elsevier, 51–83.
  27. Daisuke Takahashi. 2009. An implementation of parallel 3-D FFT with 2-D decomposition on a massively parallel cluster of multi-core processors. In International Conference on Parallel Processing and Applied Mathematics. Springer, 606–614.
  28. An OpenMP implementation of parallel FFT and its performance on IA-64 processors. In International Workshop on OpenMP Applications and Tools. Springer, 99–108.
  29. Charles Van Loan. 1992. Computational frameworks for the fast Fourier transform. SIAM.
Citations (9)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.