From 8 Seconds to 370ms: Kernel-Fused SAR Imaging on Apple Silicon via Single-Dispatch FFT Pipelines

Published 4 Apr 2026 in cs.PF | (2604.03585v1)

Abstract: We present the first kernel-fused SAR Range Doppler pipeline on any GPU platform. By fusing FFT, matched-filter multiply, and IFFT into a single Metal compute dispatch -- keeping all intermediate data in 32\,KiB on-chip memory -- we process a $4096!\times!4096$ complex SAR scene in \textbf{370\,ms} on an Apple M1 GPU, a \textbf{22$\times$} speedup over the multi-dispatch baseline (8.16\,s). We further report the first FFT to exploit Apple's \texttt{simdgroup_matrix} 8$\times$8 hardware MMA, enabled by an in-place Cooley--Tukey decimation-in-frequency formulation that halves the memory footprint versus Stockham. Radar image quality is preserved: all five point targets show 0.0\,dB SNR deviation from the unfused FP32 reference.

Abstract PDF Upgrade to Chat

Authors (1)

Mohamed Amine Bergach

Summary

The paper introduces a kernel-fused SAR RDA pipeline that merges FFT, filtering, and IFFT into a single dispatch, reducing execution time from 8.16s to 370ms on 4096×4096 scenes.
The methodology leverages Apple’s simdgroup_matrix hardware with an in-place Cooley–Tukey DIF kernel, reaching 93% of Stockham’s scalar GFLOPS while minimizing memory transfers.
Optimizing on-chip 32 KiB threadgroup and unified memory, the approach maintains precision with an L2 error of 2.44×10⁻⁷ while cutting CPU-GPU synchronization overhead.

Kernel-Fused SAR Imaging on Apple Silicon GPUs: Single-Dispatch FFT Pipelines and MMA Hardware Exploitation

Introduction

This work introduces the first kernel-fused Synthetic Aperture Radar (SAR) Range Doppler Algorithm (RDA) pipeline implemented on any GPU, with a focus on Apple Silicon. The approach fuses the forward Fast Fourier Transform (FFT), matched-filter multiplication, and inverse FFT (IFFT) into a single Metal compute dispatch, keeping all intermediate data resident in 32 KiB of on-chip threadgroup memory. The authors further present the first FFT to exploit Apple's simdgroup_matrix 8×8 hardware Matrix Multiply-Accumulate (MMA) instructions, facilitated by an in-place Cooley–Tukey decimation-in-frequency formulation.

Technical Contributions

Kernel Fusion for SAR Pipelines

Traditional GPU SAR pipelines utilize separate kernel launches for FFT, filtering, and IFFT, incurring substantial device memory traffic and latencies due to round-trips of intermediate data. This work eliminates the redundant transfers by fusing all three operations into a single dispatch performed entirely within the GPU's 32 KiB threadgroup memory. Notably, for 4096-point complex float32 range lines—a typical SAR system size—this on-chip buffer is fully utilized but never exceeded, maximizing efficiency within architectural constraints.

MMA-Accelerated FFT Kernels

Leveraging the simdgroup_matrix API unique to Apple Silicon, the authors construct the first MMA-enabled FFT for Metal. Unlike out-of-place Stockham FFTs—which would exceed the threadgroup memory capacity due to their double-buffering requirement in split (real/imaginary) layout—the implementation employs an in-place Cooley–Tukey DIF kernel. The MMA-accelerated kernel achieves 93% of Stockham’s scalar GFLOPS, with the residual gap attributed to increased threadgroup-memory transactions inherent in the split layout and the absence of double-buffering. Empirically-derived thread/lane-element mapping is necessary due to a lack of public documentation, reflecting significant low-level engineering.

Apple Silicon-Specific Optimizations

The entire SAR pipeline is executed in Metal, exploiting Apple’s unified memory and high on-chip bandwidth. Steps involving global memory (e.g., azimuth FFTs requiring matrix transposes) remain as separate dispatches due to threadgroup memory capacity limitations for columnar data. However, both range and azimuth compression steps are optimized using the proposed fusion strategies, minimizing CPU-GPU synchronization and maximizing the utilization of Metal's compute capabilities. The implementation further exploits the native device-wide coherence and zero-copy semantics for device memory, typical of Apple’s SoC architecture.

Experimental Results

Performance

On an Apple M1 GPU (8 cores, 1278 MHz, 68 GB/s DRAM), the kernel-fused pipeline processes a 4096×4096 complex float32 SAR scene in 370 ms—a 22× reduction in total execution time compared to the 8.16 s multi-dispatch baseline. Range compression, fully fused via single dispatch, completes in just 29 ms (7.1 μs/line, with the theoretical I/O bound at ≈4 ms). The overall pipeline efficiency is therefore primarily constrained by the azimuth FFT steps, which are currently limited by global transposes.

Quality Validation

Despite extensive architectural and algorithmic fusion, the radar image quality is strictly preserved. The fused and unfused outputs exhibit an L2 relative error of $2.44\times10^{-7}$ , well within FP32 round-off, and no SNR difference is observable on any of the five simulated point targets—demonstrating that kernel fusion does not sacrifice precision or interpretability for performance.

Comparison with Prior GPU SAR Work

Relative to embedded SAR implementations on NVIDIA Jetson and discrete RTX 2060 hardware, the Apple M1 result is competitive, despite a lower power envelope and distinct scene sizes and algorithms. The authors argue that the unified memory model is decisive for SAR workloads bounded by bandwidth rather than compute throughput.

Implications and Future Directions

The implications of this work are significant for both high-performance embedded radar processing and general GPU-accelerated signal processing on Apple Silicon. SAR imaging, historically dominated by custom multi-dispatch CUDA pipelines on NVIDIA GPUs, now achieves comparable efficiency and quality with a unified Apple Metal approach. By exposing the fusion potential of the threadgroup memory model and the simdgroup_matrix hardware, this pipeline establishes a new efficiency baseline for bandwidth-bound DSP workloads on commodity SoCs.

Future research directions proposed include:

Mixed-precision fusion: Utilizing native FP16 arithmetic (with zero-cycle conversion to FP32) to further double throughput while conserving radar accuracy by performing critical accumulations in FP32.
High-core M-series scaling: Extending to next-generation Apple Silicon (e.g., M4 Max, with 40 GPU cores and 546 GB/s bandwidth) for near-real-time processing of 8K×8K scenes.
Tiled transpose for azimuth steps: Alleviating the remaining bottleneck via tiled in-threadgroup memory transpose to minimize device-memory I/O.

Conclusion

This paper presents a methodologically rigorous, architecturally aware, and empirically validated SAR RDA implementation for Apple Silicon GPUs. The single-dispatch kernel fusion strategy yields a transformative reduction in overall processing time, setting a new performance and reproducibility bar for SAR and broader FFT-intensive pipelines on modern, unified-memory SoCs. The fusion and MMA approaches are directly transferable to other bandwidth-bound scientific domains on Apple hardware, and the open-source release enables further academic and industrial exploration.

Markdown Report Issue