Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

134 tokens/sec

GPT-4o

10 tokens/sec

Gemini 2.5 Pro Pro

47 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

Zero-Conflict Memory Subsystem in RISC-V Clusters

Updated 30 June 2025

Zero-conflict memory subsystem is an architecture that eliminates bank conflicts by employing hyperbanked memory layouts and double-buffering techniques.
It separates compute and DMA operations through dedicated hyperbanks and crossbar interconnects, ensuring uninterrupted, high-bandwidth memory access.
Empirical results show improved FPU utilization (up to 99%), an 11% performance gain, and an 8% energy efficiency boost compared to traditional designs.

A zero-conflict memory subsystem refers to a memory architecture, interconnect, or set of algorithmic and microarchitectural design techniques that guarantee the elimination of bank conflicts—situations in which multiple simultaneous memory requests target the same bank—under defined workload patterns and software conventions. In modern energy-efficient RISC-V clusters for machine learning acceleration, this concept is exemplified by systems that combine double-buffering-aware memory layout and banked interconnects specifically engineered to decouple concurrent direct memory access (DMA) and compute requests, as described in "Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration" (2506.10921).

1. Motivation: Memory Bank Conflicts in ML Accelerators

In cluster-based ML accelerators, multiple RISC-V cores (processing elements, PEs) share a tightly-coupled data memory (TCDM) or L1 scratchpad, organized into numerous banks for concurrent access. Typical ML workloads—especially high-throughput matrix multiplication (GEMM)—demand sustained low-latency, high-bandwidth read/write to on-chip memory. Even with multiple banks, bank conflicts (multiple requests for the same bank in one cycle) penalize throughput, leaving floating-point units (FPUs) idle and lowering the effective utilization of the system. Double buffering, commonly used for overlap of computation and data transfer (DMA), exacerbates this problem: as DMA and cores compete for different regions of the memory, their access patterns may inadvertently create further bank contention.

2. Double-Buffering-Aware Hyperbanked Memory Architecture

The zero-conflict memory subsystem introduced in the paper is based on structuring L1 memory into "hyperbanks," with careful mapping and disciplined operation. For example, in an 8-core cluster, L1 is divided into two hyperbanks, each with 24 banks (total 48 banks). At any time, one hyperbank services all compute requests (cores performing GEMM), while the second is dedicated solely to DMA (loading the next computation block or draining the previous one). The assignment between computation and DMA alternates (swaps) upon buffer exchange in the double buffering scheme.

A crossbar interconnect—one per hyperbank—routes core requests to the banks in their currently assigned hyperbank. DMA accesses route only to the banks in the other, non-compute hyperbank, ensuring that compute and DMA never contend for the same physical bank in a given phase.

Address Mapping

Addressing is organized so that each buffer's data resides within a single hyperbank, and memory accesses are interleaved among banks within the hyperbank for maximal single-cycle access efficiency:

$\text{bank}(A) = \begin{cases} \text{hyperbank}_0, & A < M \ \text{hyperbank}_1, & A \geq M \end{cases}$

Here, $M$ is the buffer size, and the intra-hyperbank bank is selected by low address bits (interleaving for locality and bandwidth).

Required Bank Count

For $N_{core}$ compute cores each performing (per cycle) up to $\alpha$ concurrent memory requests (e.g., 2 reads + 1 write per cycle in matmul), the hyperbank must have at least $N_{core} \cdot \alpha$ banks:

$\text{Banks per hyperbank} \geq (\#\text{Cores}) \times (\text{peak \#requests/core per cycle})$

This condition ensures that under the prescribed access patterns, no two requests ever collide in a bank.

3. Practical Achievements: Utilization, Performance, and Energy

By strict division of bank usage between compute and DMA (enforced by software and hardware protocols), the zero-conflict memory subsystem guarantees that:

All per-cycle memory requests from different cores are served without bank conflicts.
DMA and computation operate without interference, maximizing effective memory bandwidth.

Empirical results on the cluster show that, relative to a prior state-of-the-art (SoA) RISC-V cluster design, median FPU utilization increases from 88% (often bottlenecked by memory stalls) to 96–99%. This yields an 11% performance improvement and an 8% gain in energy efficiency versus the baseline, approaching the efficiency of specialized, fixed-function accelerators but maintaining programmability.

Compared to naïvely scaling the number of banks (e.g., a 64-bank, all-to-all fully-connected crossbar), the double-buffering-aware interconnect keeps area overhead and routing complexity minimal; a 48-bank system with two small crossbars incurs only 1% area increase, while a monolithic 64-bank crossbar would raise area by 14% and energy by 12%.

Configuration	FPU Utility	Performance (DPGflops)	Energy Eff. (DPGflops/W)	Area Overhead
Baseline	88%	7.63	22.4	0%
Dobu Interconnect (48-bank)	99%	7.92	23.2	+1%
64-bank all-to-all	98%	7.94	lower (routing cost)	+14%

4. Algorithmic and Hardware-Software Co-Design

The paper underscores that achieving a zero-conflict subsystem is a product of hardware-software contract:

Hardware: Organizes memory into hyperbanks, with crossbars and demultiplexing logic ensuring physical access separation.
Software: Enforces double-buffered operation, ensuring that computation and DMA each map their working data to distinct hyperbanks for the duration of each phase. Control flow adheres strictly to this discipline, avoiding requests outside the prescribed hyperbank.

This arrangement requires no reduction in overall programmability: the approach is compatible with a broad class of double-buffered applications, common in ML workloads, and does not depend on hard-coded access patterns or inflexible fixed-function logic.

5. Implications and Trade-Offs

The zero-conflict memory subsystem approach:

Allows a general-purpose, programmable accelerator to approach utilization and performance levels of domain-specific hardware.
Minimizes the cost and complexity of scaling banked memory (additional banks and moderate-area crossbars, rather than exponential crossbar scaling).
Limits resource contention-based stalls even as parallelism scales with more cores, avoiding the diminishing returns seen in previous architectures.

The approach introduces minimal area and energy overhead and does not restrict code flexibility, provided the software adheres to the double-buffering contract (i.e., statically partitioned compute and DMA buffers per phase). This suggests broad applicability to high-performance embedded ML systems and other bandwidth-bound kernels.

In the context of broader memory system research, the zero-conflict memory subsystem in this work targets on-chip/L1 conflict elimination in tightly-coupled shared-memory clusters by bridging hardware design with parallel software patterns. While zero-conflict schemes at the device or system level (see, e.g., bank-conflict-free algorithms for GPUs or Twin-Load for extended DRAM) focus on interface or protocol-level conflicts, this paper demonstrates that architectural partitioning and banked-interconnect microarchitecture are effective for maximizing hardware utilization in energy-constrained, programmable accelerator clusters.

This methodology is particularly relevant as AI/ML tasks continue to drive memory bandwidth requirements beyond classical CPU/GPU scaling, and as agile FPGA/RISC-V-based AI inference and training engines are increasingly deployed in edge-to-cloud environments.

PDF Markdown Chat (Upgrade)

References (1)

Towards Zero-Stall Matrix Multiplication on Energy-Efficient RISC-V Clusters for Machine Learning Acceleration (2025)