AMD Versal Implementations of FAM and SSCA Estimators (2506.18003v1)

Published 22 Jun 2025 in cs.AR

Abstract: Cyclostationary analysis is widely used in signal processing, particularly in the analysis of human-made signals, and spectral correlation density (SCD) is often used to characterise cyclostationarity. Unfortunately, for real-time applications, even utilising the fast Fourier transform (FFT), the high computational complexity associated with estimating the SCD limits its applicability. In this work, we present optimised, high-speed field-programmable gate array (FPGA) implementations of two SCD estimation techniques. Specifically, we present an implementation of the FFT accumulation method (FAM) running entirely on the AMD Versal AI engine (AIE) array. We also introduce an efficient implementation of the strip spectral correlation analyser (SSCA) that can be used for window sizes up to $2^{20}$. For both techniques, a generalised methodology is presented to parallelise the computation while respecting memory size and data bandwidth constraints. Compared to an NVIDIA GeForce RTX 3090 graphics processing unit (GPU) which uses a similar 7nm technology to our FPGA, for the same accuracy, our FAM/SSCA implementations achieve speedups of 4.43x/1.90x and a 30.5x/24.5x improvement in energy efficiency.

PDF Abstract

High-Performance FPGA Implementations of FAM and SSCA Estimators on AMD Versal

This paper presents a comprehensive paper of high-speed, energy-efficient FPGA implementations of two widely used spectral correlation density (SCD) estimation techniques—FFT Accumulation Method (FAM) and Strip Spectral Correlation Analyzer (SSCA)—on the AMD Versal VCK5000 platform. The work addresses the persistent challenge of real-time cyclostationary analysis, which is computationally intensive even with FFT-based algorithms, by leveraging the heterogeneous architecture of the Versal Adaptive SoC, particularly its AI Engine (AIE) array.

Technical Contributions

The authors introduce several notable contributions:

AIE-Only FAM Implementation: The first reported FAM implementation that operates entirely within the AIE array, requiring only 35% of available AIE tiles and no programmable logic (PL) resources. This design choice preserves PL resources for potential integration with software-defined radio (SDR) front-ends or ML back-ends, which is critical for RFML (radio frequency machine learning) pipelines.
Efficient SSCA Implementation: A novel SSCA design that supports window sizes up to $2^{20}$ samples, using a decomposed 2D FFT and PL-based transpose units to manage large intermediate matrices. This is the first reported FPGA-accelerated SSCA implementation.
Generalized Parallelization Methodology: A systematic approach to parallelizing both FAM and SSCA computations, balancing memory and bandwidth constraints inherent to the Versal architecture.

Implementation Details

FAM on AIE

The FAM implementation is optimized for small-window, high-throughput scenarios. The design is partitioned into three pipeline stages: Framing, Demodulate, and FFT2. The entire pipeline is mapped to the AIE array, exploiting its high internal bandwidth and minimizing data movement between AIE and PL.

Resource Allocation: For typical parameters ( $N=2048$ , $N_P=256$ ), 137 AIE tiles are used, with careful buffer management (ping-pong buffering) to maximize throughput within the 16 KB per-tile memory constraint.
Parallelism: The architecture supports parallel processing of multiple frequency channels, with up to 128 FFT2 kernels operating concurrently.
Dataflow Optimization: The design minimizes off-chip memory accesses and maximizes on-chip data reuse, which is essential for achieving high energy efficiency.

SSCA on Versal

The SSCA implementation targets large-window, high-resolution analysis. The key innovation is the use of a 2D FFT decomposition, which allows the computation to be distributed across the AIE array and PL, with intermediate results stored in off-chip DDR when necessary.

2D FFT Decomposition: The $N$ -point FFT is split into $M_1 \times M_2$ sub-FFTs, enabling efficient mapping to the AIE array and reducing the memory footprint of intermediate matrices.
PL-AIE Collaboration: The PL handles data transposition and manages high-bandwidth data transfers between DDR and AIE, using ping-pong buffers to hide latency and maintain continuous dataflow.
Scalability: The design supports $N$ up to $2^{20}$ and $N_P$ up to 256, with the number of AIE tiles scaling logarithmically with $N_P$ and the FFT sizes.

Experimental Results

The implementations were benchmarked against CPU (Intel Xeon Silver 4208) and GPU (NVIDIA RTX 3090) baselines. Key results include:

Speedup: For FAM, the Versal implementation achieves a 4.43x speedup over the RTX 3090 GPU and 308x over the CPU. For SSCA, the speedup is 1.90x over GPU and 99x over CPU.
Energy Efficiency: The FAM and SSCA designs are 30.5x and 24.5x more energy efficient than the GPU, respectively, as measured by power consumption during execution.
Resource Utilization: The FAM design uses 34% of AIE tiles and minimal PL resources, while the SSCA design uses only 3.75% of AIE tiles but more BRAM/URAM for buffering.
Accuracy: Both implementations achieve high numerical accuracy, with average relative errors on the order of $10^{-5}$ (FAM) and $10^{-6}$ (SSCA) compared to double-precision MATLAB and C++ baselines.

Implications and Future Directions

Practical Implications

Real-Time Cyclostationary Analysis: The demonstrated throughput and energy efficiency make these designs suitable for deployment in real-time RFML systems, where SCD-based feature extraction is a bottleneck.
Integration with SDR/ML Pipelines: The AIE-only FAM implementation leaves PL resources available for SDR front-ends or ML inference engines, facilitating end-to-end RFML solutions on a single device.
Scalability: The generalized parallelization methodology and modular design enable adaptation to larger window sizes, higher channel counts, or future Versal devices with more AIE tiles and memory bandwidth.

Theoretical Implications

Algorithm-Architecture Co-Design: The work exemplifies the importance of co-designing algorithms and hardware architectures, particularly in mapping high-complexity signal processing algorithms to heterogeneous platforms.
2D FFT Decomposition for Large-Scale Analysis: The use of 2D FFTs for SSCA demonstrates a scalable approach to handling large data volumes, which could be extended to other spectral analysis tasks.

Future Developments

Dynamic Resource Allocation: Future work could explore dynamic partitioning of AIE and PL resources based on workload characteristics, enabling adaptive trade-offs between throughput, latency, and energy consumption.
Integration with ML Back-Ends: The designs are well-positioned for integration with on-chip ML accelerators, enabling joint signal detection, classification, and feature extraction in RFML applications.
Support for Sparse or Compressive SCD Estimation: Incorporating sparse or compressive algorithms could further reduce computational and memory requirements, especially for signals with sparse spectral correlation structures.

Conclusion

This paper demonstrates that the AMD Versal platform, with its heterogeneous architecture and high-performance AIE array, is well-suited for real-time, energy-efficient cyclostationary analysis using FAM and SSCA estimators. The presented methodologies and implementation strategies provide a blueprint for deploying advanced signal processing algorithms in next-generation RFML and SDR systems, with clear advantages in speed, scalability, and energy efficiency over conventional CPU and GPU solutions.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Carol Jingyi Li (1 paper)
Ruilin Wu (2 papers)
Philip H. W. Leong (12 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/Underfox3/status/1937611659981529444