CompAir-NoC: Hybrid Memory-Compute NoC
- CompAir-NoC is an advanced network-on-chip architecture that integrates DRAM-PIM and SRAM-PIM for efficient, low-latency memory-compute coupling in LLM acceleration.
- It embeds Currying ALUs in each router to perform in-transit non-linear operations, reducing data movement and pipeline delays.
- The hierarchical ISA and hybrid bonding techniques yield significant improvements in inference latency and energy efficiency, as validated by experimental benchmarks.
CompAir-NoC refers to the advanced network-on-chip subsystem designed as part of the CompAir process-in-memory (PIM) architecture for LLM acceleration (Li et al., 17 Sep 2025). It is engineered to efficiently couple high-throughput, low-latency memory-compute fabrics (hybrid DRAM-PIM and SRAM-PIM) with embedded in-transit computation. The architecture addresses the persistent "memory wall" that limits conventional GPU and PIM systems, enabling both linear and non-linear operations to occur as data moves through the NoC. This synergistic design achieves significant improvements in inference latency and energy efficiency for LLM workloads, introducing several notable mechanisms in interconnect, compute fusion, control, and instruction set hierarchies.
1. Hybrid Integration of DRAM-PIM and SRAM-PIM
CompAir-NoC is constructed atop a "hybrid bonded" memory-compute stack, pairing DRAM-PIM for scalable capacity and vector parallelism with SRAM-PIM for sub-10 ns latency matrix operations. Hybrid bonding is achieved via high-density 3D inter-die interconnects, with densities on the order of – bonds/mm and energy costs as low as $0.05$–$0.88$ pJ/bond—orders of magnitude more efficient than off-chip HBM links.
DRAM-PIM and SRAM-PIM modules are intra-channel hybridized, sharing high-bandwidth interfaces. Modified bank column decoders (e.g., partitioned 8:1, 4:1 formats) facilitate simultaneous broad data streaming to SRAM macros for highly reusable matrix–vector tasks. Linear throughput is optimized by exploiting the formula
where denotes data reuse within the SRAM fabric. This close integration is necessary to fulfill the rapidly shifting memory–compute access patterns typical in transformer-based LLMs.
2. Embedded Arithmetic Logic Units for In-Transit Computation
Central to CompAir-NoC is the "Currying ALU," a lightweight arithmetic logic unit embedded in every NoC router. Rather than restricting the network to basic data movement, the Currying ALU executes non-linear and element-wise operations (e.g. Softmax, exponential, reductions) as flits traverse the router pipeline.
The architectural principle is the decomposition of multi-operand functions into chained unary operations—i.e., currying. For example, is re-expressed as a lambda-calculus sequence:
effectively pipelining computation across successive flits, with operand registers (ArgReg) updated at each stage. Flit-level compute occurs simultaneously with switch traversal, incurring modest pipeline delay (one to two cycles per router), and the circuitry adds only ~2.94% area to each router.
Additionally, NoC routers implement efficient tree-based reduction and broadcast schemes by using packet "Path" fields that specify hierarchical routing for collective operations. This design ensures that reduction (e.g., summation or normalization) occurs without excess data shuffling or centralized bottlenecks.
3. Hierarchical Instruction Set Architecture
Programmability is managed via a two-level hierarchical ISA, tailored to the heterogeneity and granularity of the hybrid memory-compute substrate:
- Row-Level ISA: Exposed at the DRAM bank scale, operates in SIMD mode. Instructions such as SRAM_Write, SRAM_Compute, or NoC_Reduce are broadcast across all relevant banks, facilitating bulk linear computation (matrix multiplications, attention heads).
- Packet-Level ISA: Operates within the NoC routers, managing the Currying ALU, reduction trees, micro-instruction fusion, and explicit path routing. Packet fields include instruction type, payload, iteration count, and routed sequence (Path[0]–Path[N]).
Automated translation from row-level to packet-level instructions enables seamless mapping of LLM operations. For instance, row-level instructions for distributed reduction or exponentiation are decomposed into specific packet-level commands that orchestrate pipelined compute and data movement.
4. Performance Metrics and Experimental Outcomes
Experimental analysis demonstrates substantial improvements over baseline PIM and GPU+HBM systems in LLM prefill (matrix-intensive) and decode (autoregressive) stages:
- Prefill Latency: 1.83–7.98× reduction relative to state-of-the-art DRAM-PIM systems.
- Decode Throughput: 1.95–6.28× improvement over comparable platforms.
- Energy Consumption: 3.52× reduction compared to hybrid A100+HBM-PIM systems, with comparable throughput.
Benefits become more pronounced for long-context scenarios (e.g., 128K token inference), where the NoC's in-transit compute minimizes data movement and synchronizes non-linear operations efficiently.
5. Architectural Implications for LLM Acceleration
By co-locating DRAM-PIM and SRAM-PIM and embedding in-transit computation, CompAir-NoC mitigates the memory wall and delivers energy- and latency-optimized inference for large-scale LLMs. Redundant data transfers—particularly for non-linear token-wise functions—are eliminated, and resource utilization is improved.
The hierarchical ISA and packet-level ALU fusion allow programmers to exploit both high-level SIMD control for matrix kernels and fine-grained MIMD orchestration for elementwise and collective reductions. This dual capability aligns with the demands of transformer-style inference, where linear and non-linear workloads are both critical.
Potential extensions include adopting these principles in emerging memory technologies (e.g., NVM-PIM) or further refining routing/banking strategies for ultra-scalable data-centric computing.
6. Future Directions and Broader Context
The systematic integration of hybrid memory PIM and NoC-embedded non-linear compute in CompAir-NoC constitutes the first exploration of such co-designed architectures for LLM acceleration (Li et al., 17 Sep 2025). Future research may investigate:
- Enhanced router designs supporting richer function fusion and more aggressive parallel reduction.
- Extended ISAs for broader dataflow programmability and support for new operator classes.
- Application to other domains where memory-bound data movement and in-situ computation are limiting, including large-scale graph processing, scientific simulation, and AI inference beyond transformers.
This suggests that CompAir-NoC offers a high-efficiency framework for overcoming traditional memory-compute bottlenecks in many-core inference systems, and is a strong candidate for next-generation scalable AI accelerators.