NE16 NeCTAr Accelerator SoC
- NE16 is a heterogeneous RISC-V SoC featuring general-purpose cores alongside NMCE and SpAccel units designed to accelerate dense and sparse machine learning inference.
- It employs a unified, cache-coherent memory hierarchy and optimized dataflow strategies, achieving up to 132 GOP/s/W and roughly 100x speedup over pure software implementations.
- Key design trade-offs include limited off-chip bandwidth, fixed INT8→INT16 precision, and constrained L2 cache size, with future updates aiming at mixed-precision support and enhanced scalability.
The NE16 hardware accelerator refers to the NeCTAr (Near-Cache Transformer Accelerator) system-on-chip (SoC), a heterogeneous multicore RISC-V platform fabricated in Intel’s 16 nm process node and architected for efficient dense and sparse machine learning inference. The NE16 system co-locates general-purpose RISC-V cores, tightly coupled sparse-matrix accelerators, and near-memory compute engines capable of high-throughput matrix-vector multiplications, all within a unified, cache-coherent memory and on-chip network hierarchy. The platform achieves measured silicon performance of up to 132 GOP/s/W and is demonstrated executing transformer-based models such as a ReLU-sparsified Llama (“ReLU-Llama”) with significant throughput and efficiency advances relative to software baselines (Schmulbach et al., 18 Mar 2025).
1. SoC Architecture
NE16 integrates four in-order, 5-stage Rocket “RV64GC” RISC-V cores, each connected via RoCC custom instruction interface to a dedicated Sparse Matrix Accelerator. Each core is provisioned with a 16 KB private scratchpad, L1 instruction and data caches (each 16 KB, 4-way set associative, ~2 cycles latency), and full virtual memory and system coherency. Four banks of a shared 256 KB L2 cache are physically adjacent to four Near-Memory Compute Engines (NMCEs), optimizing spatial locality and aggregate bandwidth.
The on-chip interconnect architecture consists of a unidirectional torus Network-on-Chip (NoC) for cache-coherence operations, supplemented by peripheral and memory crossbars. Off-chip DRAM access is realized via a serialized TileLink link (peak ≈ 100 MB/s) and QSPI PSRAM. Data movement is tightly controlled: near-core sparse engines load indices and values through L2 into their scratchpad under software or RoCC control, while NMCEs operate directly adjacent to L2 cache, pulling data atomically without intermediate transfers.
| Memory Level | Size | Latency | Policy/Notes |
|---|---|---|---|
| L1 (D/I cache) | 16 KB | ~2 cycles | Private per core, write-through |
| Scratchpad | 16 KB | 1 cycle | SW-managed by core/RoCC |
| L2 Shared Cache | 256 KB | 12–15 cyc | 4-bank interleaved, prefetch per bank |
| Off-chip DRAM | — | ~100+ cyc | TileLink serial @ 100 MB/s, QSPI |
2. Accelerator Microarchitecture
NE16 features two principal custom acceleration blocks: the NMCE and the Sparse Matrix Accelerator ("SpAccel").
NMCE: Each of the four NMCEs implements a 64-way INT8×INT8 parallel multiply-accumulate datapath, utilizing an internal reduction tree to produce 64 saturating INT16 outputs per cycle. The pipeline proceeds as: (1) 64-byte cache-line fetch from L2, (2) 64 MACs in parallel with accumulation, and (3) write-back to memory-mapped output registers. Programming entails issuing a “v1Reg” (64 elements), configuring a base v2 address and stride, and launching up to 32 dot-products per invocation.
Sparse Matrix Accelerator: Each SpAccel unit accesses L2 directly, natively supports INT8/INT16 sparse weights and INT8 activations, and accepts both compressed sparse row (CSR) and simple index-value pair streams. Microarchitectural modules include: a front-end decoder with a FIFO for index buffering, an out-of-order memory request generator for nonzero fetches, the MAC unit, and, in two units, a Reservation Station supporting out-of-order memory responses. Decompression overhead is approximately 3 cycles per nonzero. The index FIFO (8–16 entries) smooths gap-driven stalls in sparse code.
Dataflow and tiling strategies include partitioning dense matmuls along the N dimension (sub-tile per NMCE), unrolling to match the 64-wide NMCE datapath depth, and mapping sparse activations via SpAccel with explicit control over compressed-load, MAC-accumulate, and output staging.
3. Measured Performance and Efficiency
NE16 achieves the following in measured silicon at 0.85 V, 400 MHz:
- Peak NMCE throughput: 102.4 GOP/s (400 MHz × 4 × 64 MACs)
- Measured MatMul rate: 6.02 GOP/s (count=32 per launch and launch overhead bottleneck)
- Energy efficiency: 132 GOP/s/W
Performance relative to a 4-core Rocket software baseline shows ≈100× speedup and ≈100× energy gain (software: 1.24 GOP/s/W, 56.6 MOP/s vs hardware: 132 GOP/s/W).
| Design | Tech | Area | Voltage | fmax | Peak Eff. |
|---|---|---|---|---|---|
| NeCTAr | 16 nm | 4 mm² | 0.55–0.85V | 400 MHz | 132 GOP/s/W |
| Chen CNC | Intel 4 | 1.92mm² | 0.6–0.82V | 1.15GHz | 285 GOP/s/W |
| Rovinski | 16 nm | 15.25mm² | 0.6–0.98V | 1.4GHz | 93 GOP/s/W |
| Thestral | GF 22FDX | 1 mm² | 0.6–0.9V | 910MHz | N/A |
The throughput is governed by ; Energy per operation is .
4. Case Study: ReLU-Llama Sparse Transformer Inference
NE16 directly executes inference on a 1.7M-parameter, ReLU-sparsified Llama Transformer (≈50% activation sparsity) by allocating dense product stages to NMCEs and unstructured sparse activation × weight stages to SpAccel. The four Rocket cores orchestrate dataflow: loading compressed activation rows, launching SpAccel MACs, partial sum collection, and invoking NMCE blocks for dense feed-forward layers.
Measured end-to-end performance:
| Mode | infs/s | infs/s/W | Comments |
|---|---|---|---|
| Single-core (SW) | 1.19 | 39.0 | Pure software |
| Quad-core (SW) | 1.25 | 40.0 | Pipelined across 4 cores |
| NE16 (NMCE + SpAccel HW) | 1.28 | 45.4 | HW offloads, activ. sparsity |
Off-chip DRAM bandwidth (TileLink ~100 MB/s, utilized peak ~60 MB/s) bottlenecks further scaling. There is no measurable loss in inference accuracy vs. floating point dense baseline. The effective inference latency is ≈0.78 s per example.
5. Process Technology and Design Trade-Offs
NE16 is realized in Intel’s 16 nm process, offering mature yield, expedited tape-out (15 weeks tape-out-to-bring-up), and low nonrecurring engineering (NRE) costs. Peak fmax is capped at 400 MHz, below state-of-the-art 7 nm or Intel 4 nodes, due to wider device geometry (implying slightly higher capacitance and leakage currents).
Scaling dense throughput would require further parallel NMCE/L2 slices and memory fabric widening or the deployment of high-bandwidth memory (HBM). For sparse workloads, enhanced SpAccel count, deeper Reservation Stations, and block-sparse support are strategies identified for future scale-out. The unified software stack can dynamically route compute to either NMCE or SpAccel according to operand density.
System-level throughput is upper bounded by external DRAM bandwidth, and local capacity (256 KB L2) limits inference on large model layers. The NMCE currently only supports INT8→INT16 accumulation; addition of mixed-precision (e.g., FP16) would be required for broader model support. SpAccel’s decompression pipeline (~3 cycles per nonzero) could be further optimized by on-chip index caching and alternative compression (e.g., run-length encoding).
6. Limitations and Future Directions
NE16 is fundamentally constrained by (a) off-chip bandwidth, (b) L2 cache size, and (c) mixed-precision arithmetic limitations. Future hardware iterations are expected to address the high-latency, narrow-bandwidth bottleneck of the off-chip TileLink interface with DRAM or HBM integration, extend the NMCE datapath to encompass FP16 or mixed-precision, enlarge the last-level cache, and exploit advanced activation/data compression for sparse models. There remains open space for tighter integration of data movement orchestration, accelerator-side programmability, and transparency of memory hierarchy utilization.
In summary, NE16/NeCTAr demonstrates that a heterogeneous RISC-V SoC, implemented in a mature process within 15 weeks, can efficiently accelerate both dense and sparse transformer inference workloads, scaling to ≈100× higher throughput and efficiency over pure software baselines at >100 GOP/s/W measured energy efficiency in 16 nm (Schmulbach et al., 18 Mar 2025).