NE16 NeCTAr Accelerator SoC

Updated 21 December 2025

NE16 is a heterogeneous RISC-V SoC featuring general-purpose cores alongside NMCE and SpAccel units designed to accelerate dense and sparse machine learning inference.
It employs a unified, cache-coherent memory hierarchy and optimized dataflow strategies, achieving up to 132 GOP/s/W and roughly 100x speedup over pure software implementations.
Key design trade-offs include limited off-chip bandwidth, fixed INT8→INT16 precision, and constrained L2 cache size, with future updates aiming at mixed-precision support and enhanced scalability.

The NE16 hardware accelerator refers to the NeCTAr (Near-Cache Transformer Accelerator) system-on-chip (SoC), a heterogeneous multicore RISC-V platform fabricated in Intel’s 16 nm process node and architected for efficient dense and sparse machine learning inference. The NE16 system co-locates general-purpose RISC-V cores, tightly coupled sparse-matrix accelerators, and near-memory compute engines capable of high-throughput matrix-vector multiplications, all within a unified, cache-coherent memory and on-chip network hierarchy. The platform achieves measured silicon performance of up to 132 GOP/s/W and is demonstrated executing transformer-based models such as a ReLU-sparsified Llama (“ReLU-Llama”) with significant throughput and efficiency advances relative to software baselines (Schmulbach et al., 18 Mar 2025).

1. SoC Architecture

NE16 integrates four in-order, 5-stage Rocket “RV64GC” RISC-V cores, each connected via RoCC custom instruction interface to a dedicated Sparse Matrix Accelerator. Each core is provisioned with a 16 KB private scratchpad, L1 instruction and data caches (each 16 KB, 4-way set associative, ~2 cycles latency), and full virtual memory and system coherency. Four banks of a shared 256 KB L2 cache are physically adjacent to four Near-Memory Compute Engines (NMCEs), optimizing spatial locality and aggregate bandwidth.

The on-chip interconnect architecture consists of a unidirectional torus Network-on-Chip (NoC) for cache-coherence operations, supplemented by peripheral and memory crossbars. Off-chip DRAM access is realized via a serialized TileLink link (peak ≈ 100 MB/s) and QSPI PSRAM. Data movement is tightly controlled: near-core sparse engines load indices and values through L2 into their scratchpad under software or RoCC control, while NMCEs operate directly adjacent to L2 cache, pulling data atomically without intermediate transfers.

Memory Level	Size	Latency	Policy/Notes
L1 (D/I cache)	16 KB	~2 cycles	Private per core, write-through
Scratchpad	16 KB	1 cycle	SW-managed by core/RoCC
L2 Shared Cache	256 KB	12–15 cyc	4-bank interleaved, prefetch per bank
Off-chip DRAM	—	~100+ cyc	TileLink serial @ 100 MB/s, QSPI

2. Accelerator Microarchitecture

NE16 features two principal custom acceleration blocks: the NMCE and the Sparse Matrix Accelerator ("SpAccel").

NMCE: Each of the four NMCEs implements a 64-way INT8×INT8 parallel multiply-accumulate datapath, utilizing an internal reduction tree to produce 64 saturating INT16 outputs per cycle. The pipeline proceeds as: (1) 64-byte cache-line fetch from L2, (2) 64 MACs in parallel with accumulation, and (3) write-back to memory-mapped output registers. Programming entails issuing a “v1Reg” (64 elements), configuring a base v2 address and stride, and launching up to 32 dot-products per invocation.

Sparse Matrix Accelerator: Each SpAccel unit accesses L2 directly, natively supports INT8/INT16 sparse weights and INT8 activations, and accepts both compressed sparse row (CSR) and simple index-value pair streams. Microarchitectural modules include: a front-end decoder with a FIFO for index buffering, an out-of-order memory request generator for nonzero fetches, the MAC unit, and, in two units, a Reservation Station supporting out-of-order memory responses. Decompression overhead is approximately 3 cycles per nonzero. The index FIFO (8–16 entries) smooths gap-driven stalls in sparse code.

Dataflow and tiling strategies include partitioning dense matmuls along the N dimension (sub-tile per NMCE), unrolling to match the 64-wide NMCE datapath depth, and mapping sparse activations via SpAccel with explicit control over compressed-load, MAC-accumulate, and output staging.

3. Measured Performance and Efficiency

NE16 achieves the following in measured silicon at 0.85 V, 400 MHz:

Peak NMCE throughput: 102.4 GOP/s (400 MHz × 4 × 64 MACs)
Measured MatMul rate: 6.02 GOP/s (count=32 per launch and launch overhead bottleneck)
Energy efficiency: 132 GOP/s/W

Performance relative to a 4-core Rocket software baseline shows ≈100× speedup and ≈100× energy gain (software: 1.24 GOP/s/W, 56.6 MOP/s vs hardware: 132 GOP/s/W).

Design	Tech	Area	Voltage	fmax	Peak Eff.
NeCTAr	16 nm	4 mm²	0.55–0.85V	400 MHz	132 GOP/s/W
Chen CNC	Intel 4	1.92mm²	0.6–0.82V	1.15GHz	285 GOP/s/W
Rovinski	16 nm	15.25mm²	0.6–0.98V	1.4GHz	93 GOP/s/W
Thestral	GF 22FDX	1 mm²	0.6–0.9V	910MHz	N/A

The throughput is governed by $Throughput = f_{clk} \times N_{engines} \times N_{MACs／engine}$ ; Energy per operation is $\frac{P_{total}}{Throughput}$ .

4. Case Study: ReLU-Llama Sparse Transformer Inference

NE16 directly executes inference on a 1.7M-parameter, ReLU-sparsified Llama Transformer (≈50% activation sparsity) by allocating dense product stages to NMCEs and unstructured sparse activation × weight stages to SpAccel. The four Rocket cores orchestrate dataflow: loading compressed activation rows, launching SpAccel MACs, partial sum collection, and invoking NMCE blocks for dense feed-forward layers.

Measured end-to-end performance:

Mode	infs/s	infs/s/W	Comments
Single-core (SW)	1.19	39.0	Pure software
Quad-core (SW)	1.25	40.0	Pipelined across 4 cores
NE16 (NMCE + SpAccel HW)	1.28	45.4	HW offloads, activ. sparsity

Off-chip DRAM bandwidth (TileLink ~100 MB/s, utilized peak ~60 MB/s) bottlenecks further scaling. There is no measurable loss in inference accuracy vs. floating point dense baseline. The effective inference latency is ≈0.78 s per example.

5. Process Technology and Design Trade-Offs

NE16 is realized in Intel’s 16 nm process, offering mature yield, expedited tape-out (15 weeks tape-out-to-bring-up), and low nonrecurring engineering (NRE) costs. Peak fmax is capped at 400 MHz, below state-of-the-art 7 nm or Intel 4 nodes, due to wider device geometry (implying slightly higher capacitance and leakage currents).

Scaling dense throughput would require further parallel NMCE/L2 slices and memory fabric widening or the deployment of high-bandwidth memory (HBM). For sparse workloads, enhanced SpAccel count, deeper Reservation Stations, and block-sparse support are strategies identified for future scale-out. The unified software stack can dynamically route compute to either NMCE or SpAccel according to operand density.

System-level throughput is upper bounded by external DRAM bandwidth, and local capacity (256 KB L2) limits inference on large model layers. The NMCE currently only supports INT8→INT16 accumulation; addition of mixed-precision (e.g., FP16) would be required for broader model support. SpAccel’s decompression pipeline (~3 cycles per nonzero) could be further optimized by on-chip index caching and alternative compression (e.g., run-length encoding).

6. Limitations and Future Directions

NE16 is fundamentally constrained by (a) off-chip bandwidth, (b) L2 cache size, and (c) mixed-precision arithmetic limitations. Future hardware iterations are expected to address the high-latency, narrow-bandwidth bottleneck of the off-chip TileLink interface with DRAM or HBM integration, extend the NMCE datapath to encompass FP16 or mixed-precision, enlarge the last-level cache, and exploit advanced activation/data compression for sparse models. There remains open space for tighter integration of data movement orchestration, accelerator-side programmability, and transparency of memory hierarchy utilization.

In summary, NE16/NeCTAr demonstrates that a heterogeneous RISC-V SoC, implemented in a mature process within 15 weeks, can efficiently accelerate both dense and sparse transformer inference workloads, scaling to ≈100× higher throughput and efficiency over pure software baselines at >100 GOP/s/W measured energy efficiency in 16 nm (Schmulbach et al., 18 Mar 2025).

Markdown Report Issue Upgrade to Chat

References (1)

NeCTAr: A Heterogeneous RISC-V SoC for Language Model Inference in Intel 16 (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NE16 Hardware Accelerator.

NE16 NeCTAr Accelerator SoC

1. SoC Architecture

2. Accelerator Microarchitecture

3. Measured Performance and Efficiency

4. Case Study: ReLU-Llama Sparse Transformer Inference

5. Process Technology and Design Trade-Offs

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NE16 NeCTAr Accelerator SoC

1. SoC Architecture

2. Accelerator Microarchitecture

3. Measured Performance and Efficiency

4. Case Study: ReLU-Llama Sparse Transformer Inference

5. Process Technology and Design Trade-Offs

6. Limitations and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research