TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs (2501.14370v1)

Published 24 Jan 2025 in cs.AR and cs.DC

Abstract: As computing demand and memory footprint of deep learning applications accelerate, clusters of cores sharing local (L1) multi-banked memory are widely used as key building blocks in large-scale architectures. When the cluster's core count increases, a flat all-to-all interconnect between cores and L1 memory banks becomes a physical implementation bottleneck, and hierarchical network topologies are required. However, hierarchical, multi-level intra-cluster networks are subject to internal contention which may lead to significant performance degradation, especially for SIMD or vector cores, as their memory access is bursty. We present the TCDM Burst Access architecture, a software-transparent burst transaction support to improve bandwidth utilization in clusters with many vector cores tightly coupled to a multi-banked L1 data memory. In our solution, a Burst Manager dispatches burst requests to L1 memory banks, multiple 32b words from burst responses are retired in parallel on channels with parametric data-width. We validate our design on a RISC-V Vector (RVV) many-core cluster, evaluating the benefits on different core counts. With minimal logic area overhead (less than 8%), we improve the bandwidth of a 16-, a 256-, and a 1024--Floating Point Unit (FPU) baseline clusters, without Tightly Coupled Data Memory (TCDM) Burst Access, by 118%, 226%, and 77% respectively. Reaching up to 80% of the cores-memory peak bandwidth, our design demonstrates ultra-high bandwidth utilization and enables efficient performance scaling. Implemented in 12-nm FinFET technology node, compared to the serialized access baseline, our solution achieves up to 1.9x energy efficiency and 2.76x performance in real-world kernel benchmarkings.

Summary

The paper presents a burst access mechanism that consolidates narrow memory requests and doubles the reorder buffer depth to reduce port conflicts.
It achieves up to 77% improvement in bandwidth and 1.9x gains in energy efficiency, validated across scalable MemPool-Spatz designs with 16 to 1024 FPUs.
The enhanced response data-width minimizes serialization, enabling up to 2.76x performance gains on real-world benchmarks such as DOTP, FFT, and MATMUL.

TCDM Burst Access: Enhancing Bandwidth Utilization in High-Density Compute Clusters

The paper "TCDM Burst Access: Breaking the Bandwidth Barrier in Shared-L1 RVV Clusters Beyond 1000 FPUs" presents a novel approach to improving bandwidth efficiency in high-density compute clusters, characterized by large-scale core integration and shared-level one (L1) memory banks. The proposed TCDM Burst Access architecture effectively addresses the challenges posed by hierarchical network topologies, which manifest significant interconnect contention and performance degradation, particularly in SIMD or vector core environments.

Key Contributions and Methodology

The researchers introduce a software-transparent mechanism designed to optimize burst transaction handling in clusters with numerous vector cores connected to a multi-banked L1 data memory. The solution entails two significant enhancements:

Burst Narrow Requests: This mechanism consolidates multiple narrow memory requests into a single burst transfer, thus reducing port conflicts and fostering error-free memory requests. The authors also integrate a Burst Sender to facilitate these operations, doubling the reorder buffer (ROB) depth to manage outstanding transactions efficiently.
Enhanced Response Data-Width: Capitalizing on burst requests' nature, the response channel data width is expanded. This effectively minimizes serialization of response data, facilitating conflict-free memory transactions. The adjustment of data widths, defined by a Group Factor (GF), ensures flexibility across different hardware scales, optimizing routing resources' area-utilization.

Implementation and Results

The proposed TCDM Burst Access architecture is tested on the MemPool-Spatz design, a scalable RISC-V Vector cluster, with modifications visible across configurations scaling from 16 to 1024 FPUs. For instance, a 1024-FPU cluster manifested an improvement of 77% in bandwidth without the conventional architecture's Burst Access. Across varying configurations, bandwidth utilization achieved up to 80% of cores-memory peak bandwidth, navigating the bandwidth barriers with minimal area overhead (less than 8%).

Furthermore, using GF configurations, performance scales well with the increase in core numbers, effectuating marked gains in energy efficiency. The proposed architecture achieves up to 1.9x improvements in energy efficiency and an increase to 2.76x in performance when subjected to real-world kernel benchmarks such as DOTP, FFT, and MATMUL.

Practical and Theoretical Implications

The innovative TCDM Burst Access highlights significant implications in the field of processor architectures for high-performance computing systems. Theoretical implications suggest that the introduction of burst mechanisms and extended data-width channels can effectively mitigate hierarchical interconnect conflicts, thereby paving the way for more scalable many-core designs. Practically, these advancements reflect substantial improvements in energy efficiency and performance, signifying crucial benefits for applications and domains demanding substantial computational capacities, such as AI and machine learning, particularly with vector-infused floating-point operations.

Speculation on Future Developments

Given these results, future developments could see further exploration of adaptive mechanisms that dynamically adjust burst configurations according to workload and traffic patterns. This could lead to even higher bandwidth utilizations and a broader application of similar methodologies across different computational architectures and domains. Such advancements will be crucial in achieving greater efficiency in energy utilization and performance breakthroughs across diverse computational challenges, including real-time data processing and extensive parallel computations.

In conclusion, this paper provides a meticulously examined and effectively validated solution to an increasingly relevant constraint within high-performance computing architectures, contributing meaningfully to the foundational design strategies of future massively parallel computing systems.

PDF Markdown

Tweets

https://twitter.com/pulp_platform/status/1883795748111958194

https://twitter.com/Underfox3/status/1883812794061255145

https://twitter.com/pulp_platform/status/1908041461867622505

https://twitter.com/HPCPapers/status/1883757645418336654