High-Performance FPGA Implementations of FAM and SSCA Estimators on AMD Versal
This paper presents a comprehensive paper of high-speed, energy-efficient FPGA implementations of two widely used spectral correlation density (SCD) estimation techniques—FFT Accumulation Method (FAM) and Strip Spectral Correlation Analyzer (SSCA)—on the AMD Versal VCK5000 platform. The work addresses the persistent challenge of real-time cyclostationary analysis, which is computationally intensive even with FFT-based algorithms, by leveraging the heterogeneous architecture of the Versal Adaptive SoC, particularly its AI Engine (AIE) array.
Technical Contributions
The authors introduce several notable contributions:
- AIE-Only FAM Implementation: The first reported FAM implementation that operates entirely within the AIE array, requiring only 35% of available AIE tiles and no programmable logic (PL) resources. This design choice preserves PL resources for potential integration with software-defined radio (SDR) front-ends or ML back-ends, which is critical for RFML (radio frequency machine learning) pipelines.
- Efficient SSCA Implementation: A novel SSCA design that supports window sizes up to samples, using a decomposed 2D FFT and PL-based transpose units to manage large intermediate matrices. This is the first reported FPGA-accelerated SSCA implementation.
- Generalized Parallelization Methodology: A systematic approach to parallelizing both FAM and SSCA computations, balancing memory and bandwidth constraints inherent to the Versal architecture.
Implementation Details
FAM on AIE
The FAM implementation is optimized for small-window, high-throughput scenarios. The design is partitioned into three pipeline stages: Framing, Demodulate, and FFT2. The entire pipeline is mapped to the AIE array, exploiting its high internal bandwidth and minimizing data movement between AIE and PL.
- Resource Allocation: For typical parameters (, ), 137 AIE tiles are used, with careful buffer management (ping-pong buffering) to maximize throughput within the 16 KB per-tile memory constraint.
- Parallelism: The architecture supports parallel processing of multiple frequency channels, with up to 128 FFT2 kernels operating concurrently.
- Dataflow Optimization: The design minimizes off-chip memory accesses and maximizes on-chip data reuse, which is essential for achieving high energy efficiency.
SSCA on Versal
The SSCA implementation targets large-window, high-resolution analysis. The key innovation is the use of a 2D FFT decomposition, which allows the computation to be distributed across the AIE array and PL, with intermediate results stored in off-chip DDR when necessary.
- 2D FFT Decomposition: The -point FFT is split into sub-FFTs, enabling efficient mapping to the AIE array and reducing the memory footprint of intermediate matrices.
- PL-AIE Collaboration: The PL handles data transposition and manages high-bandwidth data transfers between DDR and AIE, using ping-pong buffers to hide latency and maintain continuous dataflow.
- Scalability: The design supports up to and up to 256, with the number of AIE tiles scaling logarithmically with and the FFT sizes.
Experimental Results
The implementations were benchmarked against CPU (Intel Xeon Silver 4208) and GPU (NVIDIA RTX 3090) baselines. Key results include:
- Speedup: For FAM, the Versal implementation achieves a 4.43x speedup over the RTX 3090 GPU and 308x over the CPU. For SSCA, the speedup is 1.90x over GPU and 99x over CPU.
- Energy Efficiency: The FAM and SSCA designs are 30.5x and 24.5x more energy efficient than the GPU, respectively, as measured by power consumption during execution.
- Resource Utilization: The FAM design uses 34% of AIE tiles and minimal PL resources, while the SSCA design uses only 3.75% of AIE tiles but more BRAM/URAM for buffering.
- Accuracy: Both implementations achieve high numerical accuracy, with average relative errors on the order of (FAM) and (SSCA) compared to double-precision MATLAB and C++ baselines.
Implications and Future Directions
Practical Implications
- Real-Time Cyclostationary Analysis: The demonstrated throughput and energy efficiency make these designs suitable for deployment in real-time RFML systems, where SCD-based feature extraction is a bottleneck.
- Integration with SDR/ML Pipelines: The AIE-only FAM implementation leaves PL resources available for SDR front-ends or ML inference engines, facilitating end-to-end RFML solutions on a single device.
- Scalability: The generalized parallelization methodology and modular design enable adaptation to larger window sizes, higher channel counts, or future Versal devices with more AIE tiles and memory bandwidth.
Theoretical Implications
- Algorithm-Architecture Co-Design: The work exemplifies the importance of co-designing algorithms and hardware architectures, particularly in mapping high-complexity signal processing algorithms to heterogeneous platforms.
- 2D FFT Decomposition for Large-Scale Analysis: The use of 2D FFTs for SSCA demonstrates a scalable approach to handling large data volumes, which could be extended to other spectral analysis tasks.
Future Developments
- Dynamic Resource Allocation: Future work could explore dynamic partitioning of AIE and PL resources based on workload characteristics, enabling adaptive trade-offs between throughput, latency, and energy consumption.
- Integration with ML Back-Ends: The designs are well-positioned for integration with on-chip ML accelerators, enabling joint signal detection, classification, and feature extraction in RFML applications.
- Support for Sparse or Compressive SCD Estimation: Incorporating sparse or compressive algorithms could further reduce computational and memory requirements, especially for signals with sparse spectral correlation structures.
Conclusion
This paper demonstrates that the AMD Versal platform, with its heterogeneous architecture and high-performance AIE array, is well-suited for real-time, energy-efficient cyclostationary analysis using FAM and SSCA estimators. The presented methodologies and implementation strategies provide a blueprint for deploying advanced signal processing algorithms in next-generation RFML and SDR systems, with clear advantages in speed, scalability, and energy efficiency over conventional CPU and GPU solutions.