HeFFTe Library: Scalable Distributed FFTs

Updated 3 February 2026

HeFFTe library is a highly efficient distributed multidimensional FFT solution for heterogeneous HPC, supporting arbitrary processor decompositions.
It offers a templated C++ API for both uniform and non-uniform load balancing, enabling customizable grid mappings and backend selections.
Its integration in imaging pipelines like RICK 2.0 demonstrates significant performance improvements and reduced communication overhead at scale.

HeFFTe (“Highly Efficient FFT for Exascale”) is a library providing distributed multidimensional Fast Fourier Transforms (FFTs) targeted at high-performance computing (HPC) environments with arbitrary processor decompositions and heterogeneous accelerator support. Designed to address the scalability and portability challenges inherent to modern scientific workflows—such as radio astronomy imaging at exascale—HeFFTe enables efficient execution of large-scale FFTs across diverse computational architectures, including multi-core CPUs and GPUs. Its adoption within complex imaging pipelines, such as the RICK 2.0 system for SKA-scale radio telescopes, exemplifies its role in overcoming communication and scalability bottlenecks while delivering high and portable performance (Lacopo et al., 27 Jan 2026).

1. Core Functionality and Programming Interface

HeFFTe is built around the provision of fully distributed, multidimensional FFTs accommodating a variety of processor grid decompositions—including slab, pencil, and brick layouts. Unlike FFTW-MPI, which restricts users to uniform domain splits, HeFFTe enables both uniform and non-uniform 1D/2D/3D decompositions, facilitating customized load balancing strategies tuned to application-specific data and workflow requirements.

The primary user interface is a templated C++ API that utilizes explicit “plan” objects. A typical usage pattern involves instantiating a grid mapping (with dimensions and processor-local offsets), constructing an FFT plan tied to a specific backend (e.g., FFTW for CPUs, cuFFT/rocFFT/oneMKL for GPUs), and invoking the transform operation:

1
2
3

heffte::grid_map map(plan_shape, rank, size);
heffte::fft3d_r2c<backend> plan(map, direction);
plan.transform(input_buffer, output_buffer);

Backends are selectable at compile-time, allowing code written against the HeFFTe API to execute unmodified across CPU-only, GPU-accelerated, and heterogeneous HPC platforms. This design enables users—such as those integrating with RICK 2.0—to maximize portability and performance with a unified codebase (Lacopo et al., 27 Jan 2026).

2. Distributed FFT Algorithms and Computational Model

HeFFTe implements multidimensional FFTs using pencil or slab decompositions, interleaving distributed all-to-all communication steps with local 1D FFT executions. For a 3D mesh of size $N_1 \times N_2 \times N_3$ distributed over $P$ processes:

Each process owns a data block of approximately $N_{\rm loc} \simeq N/P$ grid points.
Computation per process consists of three local 1D FFTs, with the aggregate cost approximated as

$T_{\rm comp}\approx 3\,\alpha_{\rm flops}\,N_{\rm loc}\log N_{\rm loc}$

where $\alpha_{\rm flops}$ reflects the FFT kernel efficiency.

Inter-process communication arises from the need to redistribute data between local FFT steps. In slab decomposition, a single redistribution is required, while pencil entails two all-to-all transposes. Communication overhead is characterized by the α–β model: $T_{\rm comm, total}\approx 2 \alpha_{\rm mpi} (P-1) + 2 \beta_{\rm mpi} \frac{N_{\rm loc}}{P}$ where $\alpha_{\rm mpi}$ denotes message startup cost and $\beta_{\rm mpi}$ the per-byte cost. Overall, the computational complexity scales as $\mathcal{O}((N/P)\log N)$ , while communication overhead increases with process count $P$ . HeFFTe abstracts these complexities from the user via its plan/transform interface; however, the practical scaling behavior is governed by the above model (Lacopo et al., 27 Jan 2026).

3. Integration and Use in RICK 2.0 Imaging Pipeline

In the RICK 2.0 imaging pipeline, HeFFTe underpins the entire distributed FFT workflow across the uvw-grid representation. Notably, RICK 2.0 eliminates the need for a global all-reduce on the full uvw-grid by leveraging a 1D non-uniform slab decomposition along the v-axis, a division supported by HeFFTe but not by FFTW-MPI.

Each process is assigned a Gaussian-weighted segment of the v-axis:

$w_v(i) = \exp\left[-\frac{(i-\mu_v)^2}{2\sigma_v^2}\right], \quad \sum w_v = P$

ensuring both computational and gridding load balance.

Ghost padding of $\epsilon = (\text{Kernel} - 1)/(2 N_u)$ is applied so each process can locally convolve with minimal halo exchange.
Input visibilities are bucket-sorted by their v-coordinate and re-distributed via MPI_Sendrecv to the owning rank, replacing the previous broadcast–allreduce model on the grid.

RICK 2.0 invokes HeFFTe FFTs per w-slice (or in batch), for example:

1 2	plan = heffte::plan<backend>(local_shape, global_size, rank, comm); plan.transform(real_grid, complex_image, direction);

Selecting the GPU backend transparently routes computation to device kernels (cuFFT or rocFFT). Transposes and local transforms are fully managed by HeFFTe, with all grid data remaining resident on the GPU except for internal host–device shuffles. OpenMP is used at the application level for convolution and w-phase correction, while HeFFTe handles device-specific FFT kernels (Lacopo et al., 27 Jan 2026).

4. Quantitative Performance Characterization

The adoption of HeFFTe within RICK 2.0 yields substantial performance gains, empirically documented on large-scale scientific workloads:

For the MeerKAT dataset (48 GB visibilities, $8192^2$ pixels, 32 w-planes, CINECA Leonardo “Booster”):

Configuration	Total Time (s)	Comm. (s)	FFT (s)	FFT Speedup
Pure MPI CPU (32 ranks)	73.77	10.24	11.96	1.00×
MPI+OpenMP CPU (4x8)	68.25	10.99	33.10	0.36×
MPI+OpenMP+GPU (4x8+4A100)	25.64	10.99	7.24	1.65×

On GPUs, FFT runtime is reduced by $\sim$ 1.65× relative to MPI+FFTW and $\sim$ 4.57× relative to MPI+OpenMP+FFTW.
For a large LOFAR test case ($363$ GB, $65,536^2$ $65, 53 6^{2}$ pixels, $128$ w-planes, $256$ nodes):
- CPU-only FFT time: $549.42$ s
- GPU FFT time: $28.90$ s ( $\sim$ 19× speedup)

System-level gains include:

End-to-end runtime reduction (CPU to GPU): $73.8$ s $\rightarrow$ $25.6$ s ( $\sim$ 2.9× speedup).
Communication overhead, previously at $96\%$ in RICK 1.x, now reduced to $\sim$ 43\% of GPU run (absolute runtime $\sim11$ s, unchanged; impact greatly diminished).
FFT and gridding no longer dominate runtime, shifting bottlenecks to I/O and filesystem throughput (Lacopo et al., 27 Jan 2026).

5. Limitations, Trade-Offs, and Ongoing Development

Portability and scalability are central design goals for HeFFTe. In RICK 2.0, the one-API, multi-backend architecture enables seamless execution on Intel/AMD/NVIDIA GPUs and multi-core CPUs. However, current deployment adopts a slab-based (1D) decomposition. Expansion to 2D pencil/brick layouts is anticipated for future scalability at extreme process counts.

Current trade-offs and limitations:

Two all-to-all transposes per 3D FFT step (in pencil decomposition) or one (in slab) persist as scaling bottlenecks, though mitigated by reduced slab size in the new RICK distribution.
Host-mediated GPU-GPU communication within HeFFTe limits efficiency in GPU deployments; absence of GPU-Direct RDMA currently incurs extra latency. Planned integration of GPU-Direct RDMA in future HeFFTe releases is projected to reduce this latency by up to 50%.
The Gaussian grid splitting parameter ( $\sigma$ ) modulates FFT and gridding load-balance; empirical studies indicate $\sigma=1000$ as optimal for specified test cases. A dynamic histogram-based splitter is under active development.
As FFT and gridding accelerate, parallel I/O and filesystem throughput become dominant bottlenecks; integration of ADIOS2 for asynchronous, nonblocking I/O is planned to mitigate these effects (Lacopo et al., 27 Jan 2026).

6. Impact and Future Prospects

The integration of HeFFTe into advanced scientific imaging codes such as RICK 2.0 has enabled the elimination of global all-reduce operations on the uvw-grid, enabling a local slab decomposition and point-to-point bucket-sort redistribution. This transformation, supported by HeFFTe’s API and backend flexibility, delivers multi-architecture FFT capability and end-to-end pipeline speedups of up to $3×$ on small test problems and $19×$ for FFT computations on large, $65$k $^2$ pixel domains. Communication, previously a critical barrier at $96\%$ of runtime, now constitutes a minority overhead, clearing the pathway for exascale-ready imaging solutions for instruments such as the SKA (Lacopo et al., 27 Jan 2026).

This development also illuminates fertile ground for future research: two-dimensional decompositions to enhance balance at scale, direct GPU-GPU network transfers, and high-throughput, asynchronous I/O subsystems are key focus areas as scientific requirements surpass current exascale boundaries.

Markdown Report Issue Upgrade to Chat

References (1)

Accelerating radio astronomy imaging with RICK: a step towards SKA-Mid and SKA-Low (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HeFFTe Library.