A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Published 27 Mar 2026 in cs.AR and cs.DC | (2603.26438v1)

Abstract: The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9x and 2.5x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper demonstrates that router-level in-network compute via DCA enables high-throughput collective operations with minimal area overhead.
Methodology includes extending AXI protocols with multi-address encoding and flit-forking logic, achieving 2–4× speedup improvements in collective primitives.
The design integrates with Snitch clusters on the open-source FlooNoC platform, offering scalable and energy-efficient performance in ML accelerators.

Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Overview and Motivation

"A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators" (2603.26438) addresses the architectural and communication challenges faced by contemporary and emerging ML accelerators as they scale to thousands of processing elements (PEs) on a single die. The exponential increase in model size and the demands of transformer-based architectures have resulted in a significant computational-to-communication performance gap. While FLOPS throughput has increased by orders of magnitude, corresponding advances in memory and on-chip communication bandwidth have lagged, leading to pronounced bottlenecks for parallel workloads, especially those that rely on collective communication primitives such as multicast, reduction, and barrier synchronization.

The paper presents a design and evaluation of a Network-on-Chip (NoC) that natively supports high-performance collective operations, with strong emphasis on hardware-software codesign and minimal resource overheads. The proposed architecture introduces Direct Compute Access (DCA), enabling the interconnect fabric to directly access and utilize cluster arithmetic units for in-network reduction, thus achieving high throughput for collective operations and improving the utilization of manycore ML systems. The design and evaluation are demonstrated on the open-source FlooNoC platform and integrated with Snitch clusters.

Architectural Contributions

Collective-Capable NoC and Protocol Extensions

The design extends FlooNoC, an open-source, AXI-compliant NoC, with protocol-level and microarchitectural support for scalable collectives. Key architectural features include:

Multi-address encoding: Extension of the AWUSER field in AXI to carry destination masks and opcodes for collective operations. This enables succinct encoding of multicast and reduction destinations, scaling logarithmically with network size.
Router microarchitecture: The base router is augmented with flit-forking logic for multicast, parallel reduction support with arbitration and deadlock-avoidance mechanisms, and support for partial and wide reductions.
Network Interface (NI): Collectives-aware NI translates between AXI transactions and NoC-internal representations of multi-destination/source address spaces. Hardware-level barriers and reductions are represented via additional logic and buffer tracking.
DCA integration: Each cluster exposes compute resources to the NoC. The router may offload reduction operations to local arithmetic units, realized as 8×64b SIMD FPUs per Snitch cluster, enabling high-throughput reductions with negligible area cost. The DCA interface is carefully integrated into the arbitration logic to minimize conflict with normal computation.

Scalability and Overhead

All collective extensions remain fully AXI4-compliant and are made general for mesh-based manycore ML accelerators.
The area overhead of adding all collective communication support (multicast, parallel reduction, and wide reduction) to the router is modest: only 16.5%, corresponding to <1% of the total compute tile area.
Timing closure, frequency, and physical design analyses demonstrate no critical path degradation at a 1 GHz target in TSMC 7 nm.

System-Level Realization

A physically realized system with multiple compute and memory tiles connected in a 2D mesh is implemented. Each tile contains a Snitch cluster with local L1 SPM and SIMD FPUs, with system-level DMA and memory-mapped address partitioning optimized for scalable collectives. Address encoding schemes impose minor topology constraints (alignment to powers-of-two), which are handled via padding.

Performance and Experimental Results

Primitive Collective Operations

Cycle-accurate RTL simulations and analytical modeling demonstrate large runtime improvements on core collective primitives:

Multicast: Hardware-accelerated multicast achieves 2.9× geomean speedup on 1–32 KiB transfers compared to hand-optimized software baselines (tree and pipelined algorithms) in a 4×4 mesh.
Reduction: In-network reductions via DCA-enabled routers achieve 2.5× geomean speedup and support aggregate throughput scaling up to 256×256 meshes.
Barrier synchronization: Hardware barriers, based on in-network reduction and multicast primitive coupling, show a per-cluster latency scaling improvement from ~3.3 cycles/cluster (software) to 1.3 cycles/cluster (hardware).

GEMM/ML Kernel Evaluation

The architectural impact is quantified via two key dataflow variants for distributed GEMM, SUMMA and FusedConcatLinear, characteristic of modern ML workloads:

GEMM SUMMA: Hardware multicast enables communication to scale efficiently up to 256×256 meshes, maintaining compute-bound operation and yielding up to 3.8× kernel speedup versus software collectives.
FusedConcatLinear (Multi-Head Attention): Hardware reduction support via DCA avoids expensive off-chip data movement and achieves up to 2.4× kernel speedup compared to software baselines.

Energy Efficiency

Gate-level simulation and physical power analysis, using post-layout netlists, validate that the cost of NoC collective support is minimal. For both GEMM kernels, energy models project up to 1.17× improvement in system-level energy for large meshes, primarily owing to communication offloading and the possibility to keep Snitch cores in low-power state during DCA aggregations.

Implications and Theoretical Impact

The research demonstrates:

Viability of router-level in-network compute: By leveraging lightly utilized arithmetic resources at the cluster level, high-throughput reductions become feasible without centralized on-chip aggregators or excessive router complexity.
Protocol-compliant collectives: AXI-level extensions for collectives can be achieved with local modifications, preserving broad hardware IP reusability.
Generalizability: The approach relies only on mesh topology, per-tile compute, and programmable communication—features present in current industrial/academic ML accelerators (e.g., WSE-3, Blackhole, XDNA, MTIA), thus readily transferable.
Redefinition of on-chip network design: Hardware-accelerated collectives, previously considered unjustifiable due to area and complexity, become practical at sub-1% tile area cost. This disrupts traditional cache-coherence inspired NoC designs and shifts collective communication onto the critical path of accelerator architecture design.

Speculation and Future Directions

As ML workloads continue to emphasize model/data scale, the criticality of efficient collective operations will further increase. Hardware infrastructure for scalable in-network collectives will likely become standard in high-performance ML ASICs and SoCs.
Dynamic resource sharing mechanisms at the NoC-router/cluster boundary (such as generalized DCA) may be extended to other forms of computation (e.g., arbitrary in-network compute, programmable on-path data transformations, irregular collectives).
With open-source, physically realized implementations, there is potential for rapid ecosystem adoption and deeper exploration into multi-chiplet or 3D-stacked NoCs using similar collective hardware protocols.

Conclusion

This work establishes that high-throughput, hardware-accelerated collective communication can be co-designed with ML accelerator NoCs at negligible cost. The DCA paradigm, protocol-level multicast/reduction support, and practical physical realization collectively enable 2–4× speedup and tangible energy benefits on realistic, bottlenecked ML workloads at scale. These outcomes strongly motivate the migration of manycore ML architecture design towards "collective-capable" NoCs, with in-network computation as a first-class component (2603.26438).

Markdown Report Issue