A 1024 RV-Cores Shared-L1 Cluster with High Bandwidth Memory Link for Low-Latency 6G-SDR (2408.08882v1)

Published 4 Aug 2024 in cs.DC

Abstract: We introduce an open-source architecture for next-generation Radio-Access Network baseband processing: 1024 latency-tolerant 32-bit RISC-V cores share 4 MiB of L1 memory via an ultra-low latency interconnect (7-11 cycles), a modular Direct Memory Access engine provides an efficient link to a high bandwidth memory, such as HBM2E (98% peak bandwidth at 910GBps). The system achieves leading-edge energy efficiency at sub-ms latency in key 6G baseband processing kernels: Fast Fourier Transform (93 GOPS/W), Beamforming (125 GOPS/W), Channel Estimation (96 GOPS/W), and Linear System Inversion (61 GOPS/W), with only 9% data movement overhead.

Summary

The paper introduces a 1024-core RISC-V design with a shared 4 MiB L1 memory achieving 7-11 cycle latency for 6G-SDR workloads.
It incorporates a modular DMA engine with High Bandwidth Memory, attaining 910 GBps peak throughput and reducing data transfer overhead to just 9%.
The architecture delivers exceptional energy efficiency, with up to 125 GOPS/W for key baseband kernels, enabling sub-millisecond processing latencies.

A 1024 RV-Cores Shared-L1 Cluster with High Bandwidth Memory Link for Low-Latency 6G-SDR

The paper presents an advanced architecture for Radio-Access Network baseband processing, consisting of 1024 RISC-V cores integrated with a shared Level-1 (L1) memory yielding ultra-low latency and high energy efficiency tailored for 6G-Software Defined Radio (SDR) workloads. This innovative system design aims to address the demanding computational and latency requirements of next-generation mobile networks.

System Architecture and Design

The architecture leverages 1024 32-bit RISC-V cores, arranged to share a 4 MiB multi-banked L1 Scratchpad Memory (SPM). This configuration is specifically optimized for low-latency memory access, exhibiting a latency range of 7 to 11 cycles due to a sophisticated hierarchical crossbar interconnect. The design is implemented using GlobalFoundries' 12 nm FinFET technology, enabling operation frequencies up to 924 MHz under nominal conditions.

To effectively manage data flows, the system incorporates a modular Direct Memory Access (DMA) engine connecting to a High Bandwidth Memory (HBM) main storage, achieving 98% peak bandwidth at 910 GBps. This setup aims to mitigate the latency typically associated with large data transfers, addressing the inherent latency of approximately 130 cycles without impacting core performance. Specifically, the architectural benefits include hidden latency due to significant L1 capacity and minimized data transfer overhead of just 9%.

Benchmark Performance on 6G-SDR Kernels

In evaluating key 6G baseband processing kernels, the system demonstrates impressive performance in operations such as Fast Fourier Transform (FFT), Beamforming, Channel Estimation, and Linear System Inversion. The presented architecture achieves exceptional energy efficiencies, posting figures such as 93 GOPS/W for FFT, 125 GOPS/W for Beamforming, and 61 GOPS/W for Linear System Inversion.

Simulated results underline the system's effectiveness, achieving less than 1 ms latency in processing massive MIMO configurations. Notably, the architecture maintains sub-70 microsecond latencies for data symbols while keeping the power consumption of the cluster under 8.8 watts, showcasing its suitability for energy-efficient 6G baseband operations.

Implications and Future Directions

The deployment of such a many-core System-on-Chip (SoC) indicates a significant advancement toward the realization of dynamic, flexible, and energy-efficient 6G network infrastructure. The capability to process data within the tight constraints of modern wireless communication standards while maintaining scalable and programmable hardware resources underscores the relevance of the open-source nature of this platform.

Potential future work may focus on refining the architecture to further decrease power consumption and increase processing throughput. Furthermore, the integration of this system with more diverse or heterogeneous computational resources could be explored to bolster support for broader AI-driven 6G applications, facilitating advanced processing tasks such as real-time adaptive beamforming and improved mobile edge computing capabilities.

In conclusion, the paper elucidates a sophisticated approach to processing in 6G-suitable environments, leveraging high-core counts and shared-memory architectures to satisfy pressing computational demands. This work stands as a pivotal framework for continuing innovation in telecom infrastructure towards the next major generation of wireless communication technologies.

PDF Markdown

Related Papers

Tweets

https://twitter.com/pulp_platform/status/1826277113193779323

https://twitter.com/gastronomy/status/1825746780585099364