Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 70 tok/s

Gemini 2.5 Pro 45 tok/s Pro

GPT-5 Medium 34 tok/s Pro

GPT-5 High 37 tok/s Pro

GPT-4o 102 tok/s Pro

Kimi K2 212 tok/s Pro

GPT OSS 120B 466 tok/s Pro

Claude Sonnet 4 38 tok/s Pro

2000 character limit reached

A 410GFLOP/s, 64 RISC-V Cores, 204.8GBps Shared-Memory Cluster in 12nm FinFET with Systolic Execution Support for Efficient B5G/6G AI-Enhanced O-RAN (2509.08608v1)

Published 10 Sep 2025 in cs.DC

Abstract: We present HeartStream, a 64-RV-core shared-L1-memory cluster (410 GFLOP/s peak performance and 204.8 GBps L1 bandwidth) for energy-efficient AI-enhanced O-RAN. The cores and cluster architecture are customized for baseband processing, supporting complex (16-bit real&imaginary) instructions: multiply&accumulate, division&square-root, SIMD instructions, and hardware-managed systolic queues, improving up to 1.89x the energy efficiency of key baseband kernels. At [email protected], HeartStream delivers up to 243GFLOP/s on complex-valued wireless workloads. Furthermore, the cores also support efficient AI processing on received data at up to 72 GOP/s. HeartStream is fully compatible with base station power and processing latency limits: it achieves leading-edge software-defined PUSCH efficiency (49.6GFLOP/s/W) and consumes just 0.68W ([email protected]), within the 4 ms end-to-end constraint for B5G/6G uplink.

Collections

Summary

The paper introduces HeartStream, a fully programmable 64-core RISC-V cluster achieving 410 GFLOP/s peak performance and 204.8GBps L1 bandwidth for B5G/6G O-RAN.
It employs a hierarchical shared-memory design and flexible systolic execution to reduce memory overhead and accelerate matrix operations for baseband and AI tasks.
The processor delivers high energy efficiency (up to 213 GFLOP/s/W) and low-latency (<4 ms) performance, supporting diverse MIMO and deep learning workloads.

HeartStream: A 64-Core RISC-V Shared-Memory Cluster for Energy-Efficient B5G/6G AI-Enhanced O-RAN

Introduction and Motivation

The transition from 5G Cloud RAN to 6G O-RAN necessitates open, programmable hardware/software platforms capable of supporting distributed intelligence and multi-vendor interoperability. The computational and energy demands of base stations are intensifying due to requirements for $>$ 20 Gbps uplink throughput and $<$ 4 ms end-to-end latency, especially as AI-native processing converges with wireless workloads. HeartStream addresses these challenges by delivering a fully programmable, energy-efficient, and high-performance manycore cluster tailored for B5G/6G O-RAN edge deployments.

Figure 1: Energy-efficient open-HW/SW designs support B5G/6G O-RAN demanding use-case scenarios and a wide range of applications.

Architecture Overview

HeartStream comprises 64 RISC-V cores organized in a hierarchical shared-L1-memory cluster, achieving 410 GFLOP/s peak performance and 204.8 GBps L1 bandwidth in 12nm FinFET. The memory subsystem features 256×1 KiB banks, interleaved across 16 Tiles in 4 Groups, enabling 1–5 cycle low-latency access. Each Tile integrates four Core Complexes (CCs), each with a 32b RISC-V core, Integer Processing Unit (IPU), Floating Point Sub-System (FP-SS), and Systolic Queue-Linked Register (QLR). A Tile-shared FP division/square-root unit accelerates matrix inversion for MIMO detection.

Figure 2: HeartStream's 64 RISC-V cores shared-L1-memory hierarchical design architecture. L1 memory addresses are 32-bit interleaved across banks of 16 Tiles in 4 Groups. Each Tile's cores share an FP division/square-root unit. Core-Complex includes a 32b RISC-V core, IPU, FP-SS, and Systolic QLR.

The die micrograph demonstrates a compact 5 mm² implementation with 65% logic cell utilization in the core area.

Figure 3: Die micrograph and design summary. HeartStream was implemented in GlobalFoundries' 12nm FinFET technology on a 5 mm² die. It achieves a 65% high utilization logic cell placement in the core area.

Systolic Execution and Data Movement

A key innovation is hardware-supported flexible systolic execution with programmable topology. QLRs enable implicit inter-core register-file communication, eliminating explicit memory and control instructions. This is particularly effective for matrix multiplication (MatMul) and complex FFT (CFFT), where data is streamed between cores via QLRs, reducing synchronization and memory access overhead.

Figure 4: In a systolic MatMul, cores at the edge of the topology fetch from L1 and then forward data through QLR/memory queues, eliminating memory and control instructions; pseudocode shows that implicit inter-core communication eliminates many memory access and control instructions, boosting performance.

Performance and Energy Efficiency

HeartStream achieves up to 243 GFLOP/s on complex-valued baseband workloads at 0.8V, with IPCs ranging from 0.52–0.88. On deep learning integer benchmarks (MatMul, Conv2D, DotP), IPCs reach 0.84–0.96 and up to 72 GOP/s. Systolic kernels demonstrate higher compute utilization and reduced overhead instructions.

Figure 5: Absolute runtime and instruction/stall cycle fractions for 16-bit complex (real/imaginary) baseband and 32-bit integer deep learning kernels. Systolic kernels achieve higher compute utilization and performance by reducing overhead instructions.

PUSCH processing, a latency-critical baseband task, is efficiently mapped to HeartStream. For a typical TTI with 14 symbols and 1024 subcarriers, the pipeline includes CFFT, MatMul, DMRS-based channel estimation, and MMSE equalization. Systolic extensions yield 50% and 12% runtime reduction for OFDM and beamforming, respectively.

Figure 6: The baseband PUSCH processing steps for a transition time interval with 14 symbols, 1024 Sub-Carries (SC) in 15kHz spacing, and compute runtimes breakdown for different $N_{RX}\times N_{TX}$ scenarios.

Energy efficiency is maximized at low-voltage operation (645 MHz @ 0.65V), achieving 33.2 GFLOP/s/W for OFDM and 213 GFLOP/s/W for beamforming, with up to 1.89× improvement due to systolic execution.

Figure 7: HeartStream's efficiency and performance on key kernels for baseband and deep learning processing. The different core supply voltages target high energy efficiency or high performance. The systolic extension improves energy efficiency up to $1.89\times$ .

Runtime and Quality of Service

HeartStream sustains $<$ 4 ms end-to-end latency for uplink processing in 8×8 MIMO scenarios, supporting up to 32 antennas, 8 beams, and 8 users. Runtime and energy efficiency breakdowns for PUSCH steps show that low-voltage operation is preferable for low $N_{RX}\times N_{TX}$ , while high-voltage is optimal for high $N_{RX}\times N_{TX}$ .

Figure 8: Runtime and energy efficiency breakdown of PUSCH processing steps: low-voltage for low $N_{RX}\times N_{TX}$ achieves energy efficiency, and high-voltage for high $N_{RX}\times N_{TX}$ achieves high performance.

Mixed-precision 16/32-bit floating-point extensions maintain BER/SNR parity with 64b golden models in 16×16 MIMO MMSE (AWGN channel), ensuring high QoS.

Figure 9: BER vs. SNR of a 16x16 MIMO MMSE (AWGN channel), implemented with mixed-precision 16/32-bit floating-point extensions, yields the same results as the 64b golden model. Lower SNR at a given BER indicates higher Quality of Service (QoS).

Comparison with State-of-the-Art

HeartStream is the first open-source, fully programmable RISC-V O-RAN processor supporting full B5G/6G uplink and deep learning operators. It delivers the highest peak performance (410 GFLOP/s) and competitive throughput/energy efficiency compared to partially programmable and fixed-function ASIC/ASIP designs. Notably, it supports flexible MIMO sizes and diverse network scenarios, unlike prior solutions with inflexible datapaths.

Implementation and Deployment Considerations

HeartStream's open-source RTL and software stack facilitate integration into O-RAN edge deployments. The hierarchical shared-memory architecture and systolic execution model are well-suited for both baseband and AI workloads, enabling rapid adaptation to evolving standards and heterogeneous workloads. The design is compatible with base station power and latency constraints, consuming only 0.68 W at 645 MHz/0.65V, and is scalable for future increases in antenna and user counts.

Implications and Future Directions

HeartStream demonstrates that fully programmable, energy-efficient manycore clusters can meet the stringent requirements of B5G/6G O-RAN, including AI-native processing. The convergence of wireless and AI workloads is enabled by architectural flexibility, mixed-precision support, and systolic execution. Future work may explore scaling to larger clusters, integration with advanced memory technologies (e.g., HBM), and further specialization for emerging AI/ML baseband algorithms.

Conclusion

HeartStream establishes a new reference for programmable, energy-efficient baseband and AI processing in B5G/6G O-RAN. Its architectural innovations—hierarchical shared-memory, systolic execution, and mixed-precision support—yield high performance, flexibility, and energy efficiency, positioning it as a viable platform for next-generation wireless edge deployments and AI-native RAN systems.