FlooNoC: A 645 Gbps/link 0.15 pJ/B/hop Open-Source NoC with Wide Physical Links and End-to-End AXI4 Parallel Multi-Stream Support (2409.17606v2)

Published 26 Sep 2024 in cs.AR

Abstract: The new generation of domain-specific AI accelerators is characterized by rapidly increasing demands for bulk data transfers, as opposed to small, latency-critical cache line transfers typical of traditional cache-coherent systems. In this paper, we address this critical need by introducing the FlooNoC Network-on-Chip (NoC), featuring very wide, fully Advanced eXtensible Interface (AXI4) compliant links designed to meet the massive bandwidth needs at high energy efficiency. At the transport level, non-blocking transactions are supported for latency tolerance. Additionally, a novel end-to-end ordering approach for AXI4, enabled by a multi-stream capable Direct Memory Access (DMA) engine simplifies network interfaces and eliminates inter-stream dependencies. Furthermore, dedicated physical links are instantiated for short, latency-critical messages. A complete end-to-end reference implementation in 12nm FinFET technology demonstrates the physical feasibility and power performance area (PPA) benefits of our approach. Utilizing wide links on high levels of metal, we achieve a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps for an 8x4 mesh of processors cluster tiles, with a total of 288 RISC-V cores. The NoC imposes a minimal area overhead of only 3.5% per compute tile and achieves a leading-edge energy efficiency of 0.15 pJ/B/hop at 0.8 V. Compared to state-of-the-art NoCs, our system offers three times the energy efficiency and more than double the link bandwidth. Furthermore, compared to a traditional AXI4-based multi-layer interconnect, our NoC achieves a 30% reduction in area, corresponding to a 47% increase in GFLOPSDP within the same floorplan.

Summary

The paper presents FlooNoC, a novel NoC achieving 645 Gbps per link and 0.15 pJ/byte/hop energy efficiency.
The design leverages wide physical links and decouples network and transport layers using a custom AXI4-compliant interface.
The solution minimizes area overhead while providing scalable, high-performance connectivity for modern AI accelerator architectures.

Overview of FlooNoC: An Open-Source Network-on-Chip Solution

The paper presents FlooNoC, a novel Network-on-Chip (NoC) specifically designed to meet the growing bandwidth and energy efficiency demands of modern AI accelerators. It features significantly wide physical links and introduces end-to-end AXI4-compliant support tailored for bulk data transfers typical in contemporary high-performance computing environments. This work addresses a critical need in the design of domain-specific AI accelerators characterized by the rapid increase in data transfer requirements.

Key Contributions and Findings

Wide Physical Links: The FlooNoC employs wide physical links, achieving a bandwidth of 645 Gbps per link and a total aggregate bandwidth of 103 Tbps in an 8x4 processor mesh. This design choice exploits modern VLSI technology's increased routing resources, enabling high throughput without increasing the operating frequency, thereby maintaining energy efficiency.
End-to-End AXI Support: To manage the AXI protocol's inherent complexity, especially the requirement for transaction ordering based on Transaction IDs (TxnIDs), the FlooNoC design decouples the network and transport layers. This separation is achieved by a novel network interface (NI) that facilitates AXI4-compliant transactions, eliminating inter-stream dependencies and offering substantial area savings compared to reorder buffers (ROBs).
Energy Efficiency and Area Implications: With an energy efficiency of 0.15 pJ/byte/hop, FlooNoC demonstrates a significant improvement over state-of-the-art NoCs, providing three times the energy efficiency and more than double the link bandwidth. The integration of the NoC within a compute tile results in a minimal area overhead of 3.5%, highlighting the design's physical viability and suitability for large-scale integration.

Implications and Future Directions

The introduction of FlooNoC has several implications for the design of scalable, efficient on-chip networks in next-generation systems-on-chip (SoCs). The demonstrated benefits in bandwidth and energy efficiency present a compelling case for shifting towards wide link NoCs in AI applications. The separation of network and transport layers could inform future designs in accommodating complex protocols such as AXI without sacrificing performance or area.

Future work may focus on expanding FlooNoC's applicability to other domains, such as general-purpose computing, where similar demands for high bandwidth and energy efficiency are prevalent. Additionally, exploring adaptive routing techniques and enhancing scalability beyond the current 8x4 tile mesh could further extend FlooNoC's benefits.

Comparative Analysis

When evaluated against existing solutions like Piton and Celerity, FlooNoC exhibits superior performance characteristics, particularly in energy and bandwidth metrics. Its ability to offer significant area reductions compared to hierarchically designed interconnects like those found in Occamy exemplifies its practical benefits.

Conclusion

FlooNoC represents a significant advancement in NoC design, providing a highly efficient, scalable solution for addressing the bandwidth-intensive nature of modern AI workloads. Its open-source availability ensures broader access for further research and development, potentially spurring innovations across various computational fields.