Inter-core Connected Neural Processing Units (NPUs)

Updated 24 June 2025

Inter-core Connected Neural Processing Units (NPUs) are a class of AI accelerators designed around arrays of homogeneous or near-homogeneous processing cores, featuring explicit on-chip interconnects, typically realized as networks-on-chip (NoC). They depart from traditional monolithic NPU designs by exposing the core-level topology, enabling scalable, parallel execution for deep learning workloads ranging from computer vision to LLMs. This architecture brings both opportunities and complexities in mapping neural network computation, maximizing hardware utilization, and supporting cloud-scale multi-tenancy.

1. Architecture and Topology of Inter-core Connected NPUs

Inter-core connected NPUs, exemplified by systems such as Graphcore IPU, Tenstorrent, and many academic designs, consist of tens to thousands of processing cores linked via a topology-aware NoC, most commonly a 2D mesh. Each core typically integrates a set of multiply-accumulate (MAC) units, dedicated SRAM scratchpad memory, local DMA engines, and a router for network traffic.

The physical interconnect topology directly impacts how instructions, data, and gradients move between cores. On-chip networks implement dimension-order routing (DOR) or more advanced schemes to ensure packets follow efficient, non-conflicting paths that respect virtual and physical partitioning.

Route Virtualization

A central challenge is to maintain mapping flexibility without sacrificing isolation or performance. The vNPU framework introduces route virtualization (vRouter), in which a per-VM routing table translates virtual core and topology IDs to physical addresses. This allows each VM or workload tenant to operate as if it had access to its own, isolated NPU with a virtual core-grid mapping, while the hardware dynamically redirects data and instruction packets (both at the NPU controller and the NoC level) to the actual allocated physical resources. Routing tables are optimized for both regular and irregular topologies, with compact representations (e.g., start address and grid shape) for mesh arrays (Feng et al., 13 Jun 2025 ).

2. Memory Virtualization and Bandwidth Optimization

Given that on-core SRAM and main HBM/DRAM are often bottlenecks in deep learning inference and training, it is critical to virtualize and manage memory efficiently. Inter-core NPUs typically load and store entire tensors or weight matrices via burst DMA operations, making page-level memory virtualization (as done in CPUs and GPUs) unsuitable due to the frequency of translation stalls.

vNPU addresses this with chunk-granularity memory virtualization (vChunk), introducing a Range Translation Table (RTT) where each entry maps a (virtual base, physical base, size). Workloads often access weights and activations in a monotonic, bulk-fashion, allowing for low-overhead translation. By keeping meta-data (RTTs and pointers like RTT_CUR/last_v) in dedicated on-chip SRAM, the system minimizes latency and stalls, especially for iterative and regular ML workloads (Feng et al., 13 Jun 2025 ).

This approach maintains high memory bandwidth by:

Avoiding small, frequent TLB misses.
Prefetching and loading only required tensor segments.
Supporting large, contiguous weight transfers matching burst DMA windows.

3. Topology-Aware Mapping and Resource Utilization

Inter-core NPUs face the topological lock-in problem: mismatches between virtual/requested and available subtopologies can fragment the physical resource, leading to underutilization. This is acute for cloud-scale AI serving, where diverse tenants may have varying requirements.

vNPU introduces a best-effort topology mapping algorithm that searches over all subtopologies using minimum edit distance (measured by node/edge insertions, deletions, and substitutions) to find the closest physical match to each VM's request. This ensures that even with fragmented or partially occupied hardware, new virtual NPUs can be mapped with minimal performance penalty and maximum resource allocation (Feng et al., 13 Jun 2025 ). Unlike rigid partitioning in schemes such as MIG (NVIDIA Multi-Instance GPU), this method minimizes idle hardware, allows flexible VM sizes, and reduces internal fragmentation.

Mapping Strategy	Resource Utilization	Flexibility	Avoids Fragmentation
Static/naive	Low–Medium	Poor	No
MIG-style partition	Medium	Moderate	Partially
vNPU best-effort	High	High	Yes

4. System Prototype and Large-Scale Simulation

Prototype implementations of inter-core NPU virtualization have been realized both in reconfigurable logic and detailed simulation:

A Chipyard+FireSim-based FPGA prototype integrates the vRouter/vChunk logic with Gemmini NPU clusters and RISC-V CPUs. Meta-tables and translation logic are implemented in on-chip SRAM, and meta-table updates are reserved to the hypervisor. Although small-scale, this validates correctness and low overhead (<2%) in real RTL.
Large-scale simulation using DCRA extends the scope to 36–48 core arrays with hierarchical SRAM pools, enabling system-level evaluation for models such as GPT-2 and ResNet. These demonstrate vNPU’s scaling properties and allow for resource/provisioning studies surpassing limitations of hardware emulation (Feng et al., 13 Jun 2025 ).

Empirically, vNPU achieves:

Up to 2.29x speedup over UVM-style baseline on transformer workloads.
1.92x speedup versus MIG-style partitioning on large GPT-2 workloads.
Average end-to-end performance overheads from virtualization remain under 1%.

5. Impact on Multi-tenant AI Serving and Model Scalability

Topology-aware virtualization enables:

Fine-grained, secure multi-tenancy for cloud and edge AI platforms, ensuring strict isolation and fairness.
High utilization in dynamic, heterogeneous AI serving environments (e.g., LLMs, CNNs, GNNs) where workload shapes and sizes vary over time.
Optimized placement of virtual NPU partitions, directly impacting tail latency, throughput, and system-wide TCO.

For transformer workloads (GPT-2/BERT), which stress both compute and inter-core bandwidth, vNPU’s mapping and routing keep path lengths minimized, reducing communication delays across attention and MLP layers. For CNNs and classical vision models, the same mechanisms avoid hardware idling as problem sizes scale.

6. Future Directions and Generalization

Several avenues remain for advancing topology-aware, inter-core connected NPU design:

Temporal/spatial multiplexing: Combining time-sharing and spatial allocation to support overprovisioned and bursty workloads.
Hybrid core types: Allocating matrix- or vector-optimized virtual cores as model characteristics demand.
KV-cache and activation state management: For LLMs, supporting dynamic offload and migration of key-value caches between virtual instances.
Graph and GNN optimization: Tailoring memory virtualization and routing for irregular access patterns and dynamic computation graphs.
Hardware–software co-design: Exposing APIs for user-driven virtual NPU shape requests through modern ML frameworks.

As AI hardware moves rapidly toward ever-denser inflections of inter-core connected NPUs, the principles of vNPU—route virtualization, chunk-grained memory mapping, and topology-aware allocation—establish foundational expectations for scalable, efficient, and flexible AI accelerator deployment in both research and production clouds.

Summary Table: Key Features and Outcomes of Topology-aware Virtualization

Technique	Functionality	Performance Outcome
Route virtualization (vRouter)	Instruction/data redirection, isolation	1-2% overhead; full VM isolation
Memory virtualization (vChunk)	Burst-optimized chunk mapping	<4.3% overhead; high BW
Topology mapping	Optimal subgraph allocation	Up to 2x speedup vs. baselines
FPGA & Sim Proto.	Real-world validation	Matches analytical predictions

These mechanisms form an architecture-agnostic foundation for modern and future inter-core connected NPU systems.

PDF Markdown Bookmark Chat (Pro)