HeFFTe Library: Scalable Distributed FFTs
- HeFFTe library is a highly efficient distributed multidimensional FFT solution for heterogeneous HPC, supporting arbitrary processor decompositions.
- It offers a templated C++ API for both uniform and non-uniform load balancing, enabling customizable grid mappings and backend selections.
- Its integration in imaging pipelines like RICK 2.0 demonstrates significant performance improvements and reduced communication overhead at scale.
HeFFTe (“Highly Efficient FFT for Exascale”) is a library providing distributed multidimensional Fast Fourier Transforms (FFTs) targeted at high-performance computing (HPC) environments with arbitrary processor decompositions and heterogeneous accelerator support. Designed to address the scalability and portability challenges inherent to modern scientific workflows—such as radio astronomy imaging at exascale—HeFFTe enables efficient execution of large-scale FFTs across diverse computational architectures, including multi-core CPUs and GPUs. Its adoption within complex imaging pipelines, such as the RICK 2.0 system for SKA-scale radio telescopes, exemplifies its role in overcoming communication and scalability bottlenecks while delivering high and portable performance (Lacopo et al., 27 Jan 2026).
1. Core Functionality and Programming Interface
HeFFTe is built around the provision of fully distributed, multidimensional FFTs accommodating a variety of processor grid decompositions—including slab, pencil, and brick layouts. Unlike FFTW-MPI, which restricts users to uniform domain splits, HeFFTe enables both uniform and non-uniform 1D/2D/3D decompositions, facilitating customized load balancing strategies tuned to application-specific data and workflow requirements.
The primary user interface is a templated C++ API that utilizes explicit “plan” objects. A typical usage pattern involves instantiating a grid mapping (with dimensions and processor-local offsets), constructing an FFT plan tied to a specific backend (e.g., FFTW for CPUs, cuFFT/rocFFT/oneMKL for GPUs), and invoking the transform operation:
1 2 3 |
heffte::grid_map map(plan_shape, rank, size); heffte::fft3d_r2c<backend> plan(map, direction); plan.transform(input_buffer, output_buffer); |
Backends are selectable at compile-time, allowing code written against the HeFFTe API to execute unmodified across CPU-only, GPU-accelerated, and heterogeneous HPC platforms. This design enables users—such as those integrating with RICK 2.0—to maximize portability and performance with a unified codebase (Lacopo et al., 27 Jan 2026).
2. Distributed FFT Algorithms and Computational Model
HeFFTe implements multidimensional FFTs using pencil or slab decompositions, interleaving distributed all-to-all communication steps with local 1D FFT executions. For a 3D mesh of size distributed over processes:
- Each process owns a data block of approximately grid points.
- Computation per process consists of three local 1D FFTs, with the aggregate cost approximated as
where reflects the FFT kernel efficiency.
Inter-process communication arises from the need to redistribute data between local FFT steps. In slab decomposition, a single redistribution is required, while pencil entails two all-to-all transposes. Communication overhead is characterized by the α–β model: where denotes message startup cost and the per-byte cost. Overall, the computational complexity scales as , while communication overhead increases with process count . HeFFTe abstracts these complexities from the user via its plan/transform interface; however, the practical scaling behavior is governed by the above model (Lacopo et al., 27 Jan 2026).
3. Integration and Use in RICK 2.0 Imaging Pipeline
In the RICK 2.0 imaging pipeline, HeFFTe underpins the entire distributed FFT workflow across the uvw-grid representation. Notably, RICK 2.0 eliminates the need for a global all-reduce on the full uvw-grid by leveraging a 1D non-uniform slab decomposition along the v-axis, a division supported by HeFFTe but not by FFTW-MPI.
- Each process is assigned a Gaussian-weighted segment of the v-axis:
ensuring both computational and gridding load balance.
- Ghost padding of is applied so each process can locally convolve with minimal halo exchange.
- Input visibilities are bucket-sorted by their v-coordinate and re-distributed via MPI_Sendrecv to the owning rank, replacing the previous broadcast–allreduce model on the grid.
RICK 2.0 invokes HeFFTe FFTs per w-slice (or in batch), for example:
1 2 |
plan = heffte::plan<backend>(local_shape, global_size, rank, comm); plan.transform(real_grid, complex_image, direction); |
4. Quantitative Performance Characterization
The adoption of HeFFTe within RICK 2.0 yields substantial performance gains, empirically documented on large-scale scientific workloads:
For the MeerKAT dataset (48 GB visibilities, pixels, 32 w-planes, CINECA Leonardo “Booster”):
| Configuration | Total Time (s) | Comm. (s) | FFT (s) | FFT Speedup |
|---|---|---|---|---|
| Pure MPI CPU (32 ranks) | 73.77 | 10.24 | 11.96 | 1.00× |
| MPI+OpenMP CPU (4x8) | 68.25 | 10.99 | 33.10 | 0.36× |
| MPI+OpenMP+GPU (4x8+4A100) | 25.64 | 10.99 | 7.24 | 1.65× |
- On GPUs, FFT runtime is reduced by 1.65× relative to MPI+FFTW and 4.57× relative to MPI+OpenMP+FFTW.
- For a large LOFAR test case ($363$ GB, pixels, $128$ w-planes, $256$ nodes):
- CPU-only FFT time: $549.42$ s
- GPU FFT time: $28.90$ s (19× speedup)
System-level gains include:
- End-to-end runtime reduction (CPU to GPU): $73.8$ s $25.6$ s (2.9× speedup).
- Communication overhead, previously at in RICK 1.x, now reduced to 43\% of GPU run (absolute runtime s, unchanged; impact greatly diminished).
- FFT and gridding no longer dominate runtime, shifting bottlenecks to I/O and filesystem throughput (Lacopo et al., 27 Jan 2026).
5. Limitations, Trade-Offs, and Ongoing Development
Portability and scalability are central design goals for HeFFTe. In RICK 2.0, the one-API, multi-backend architecture enables seamless execution on Intel/AMD/NVIDIA GPUs and multi-core CPUs. However, current deployment adopts a slab-based (1D) decomposition. Expansion to 2D pencil/brick layouts is anticipated for future scalability at extreme process counts.
Current trade-offs and limitations:
- Two all-to-all transposes per 3D FFT step (in pencil decomposition) or one (in slab) persist as scaling bottlenecks, though mitigated by reduced slab size in the new RICK distribution.
- Host-mediated GPU-GPU communication within HeFFTe limits efficiency in GPU deployments; absence of GPU-Direct RDMA currently incurs extra latency. Planned integration of GPU-Direct RDMA in future HeFFTe releases is projected to reduce this latency by up to 50%.
- The Gaussian grid splitting parameter () modulates FFT and gridding load-balance; empirical studies indicate as optimal for specified test cases. A dynamic histogram-based splitter is under active development.
- As FFT and gridding accelerate, parallel I/O and filesystem throughput become dominant bottlenecks; integration of ADIOS2 for asynchronous, nonblocking I/O is planned to mitigate these effects (Lacopo et al., 27 Jan 2026).
6. Impact and Future Prospects
The integration of HeFFTe into advanced scientific imaging codes such as RICK 2.0 has enabled the elimination of global all-reduce operations on the uvw-grid, enabling a local slab decomposition and point-to-point bucket-sort redistribution. This transformation, supported by HeFFTe’s API and backend flexibility, delivers multi-architecture FFT capability and end-to-end pipeline speedups of up to $3×$ on small test problems and $19×$ for FFT computations on large, $65$k pixel domains. Communication, previously a critical barrier at of runtime, now constitutes a minority overhead, clearing the pathway for exascale-ready imaging solutions for instruments such as the SKA (Lacopo et al., 27 Jan 2026).
This development also illuminates fertile ground for future research: two-dimensional decompositions to enhance balance at scale, direct GPU-GPU network transfers, and high-throughput, asynchronous I/O subsystems are key focus areas as scientific requirements surpass current exascale boundaries.