NullHop Architecture
- NullHop is a CNN accelerator architecture that leverages activation sparsity and a custom compressed format to minimize data movement and power consumption.
- It features a flexible pipeline with stages for input decoding, pixel allocation, MAC operations, and on-the-fly pooling/ReLU to sustain high compute efficiency.
- The design supports kernel sizes from 1Ć1 to 7Ć7 and up to 128 output feature maps per pass, delivering throughput and power efficiency that outperforms traditional GPU solutions.
NullHop is a convolutional neural network (CNN) accelerator architecture that leverages activation sparsity to optimize the compute and memory demands of state-of-the-art visual processing networks. Designed to address the inefficiencies of traditional hardware such as GPUs in terms of power consumption and compute utilization, NullHop implements a flexible, high-throughput pipeline that minimizes off-chip data movement, sustains high compute efficiency across a wide range of kernel sizes, and exploits sparsity via a custom-encoded compressed format. The system supports up to 128 input and 128 output feature maps per layer in a single pass, is compatible with kernel sizes from 1Ć1 to 7Ć7, and achieves power efficiencies far exceeding typical GPU or embedded solutions (Aimar et al., 2017).
1. Architectural Overview and Dataflow
NullHop accepts compressed feature maps from off-chip DRAM via DMA, processes them through a pipeline comprising input decoding, pixel allocation, sparse matrix controllers, a MAC array, pooling/ReLU/encoding, and outputs the result via DMA. The principal components are:
- Input Decoder (IDP): Stores up to 512 KB of compressed inputs in SRAM, uses āInput Trackerā row pointers, and āIDP Managerā FSMs to read only nonzero activations per vertical stripe, implementing cycle-level zero-skipping.
- Pixel Allocator: Assigns nonzero pixels to one of sparse matrix controllers based on feature-map index.
- Sparse Matrix Controllers: Feed clusters of MAC units using associated kernel memory banks.
- MAC Array: Consists of 128 MAC units partitioned among eight controllers, supporting full parallelism for up to 128 convolutions.
- Pooling/ReLU/Encoder (PRE): Holds double-row partial sums, performs on-the-fly 2Ć2 max-pooling and ReLU, and applies output compression.
- Output Encoder: Re-compresses feature maps and streams them to external DRAM.
The dataflow facilitates sustained, near-peak utilization of compute resources, utilizing compressed input and output formats throughout all on-chip stages (Aimar et al., 2017).
2. Sparse Feature-Map Encoding and Compression
NullHop employs a two-stage encoding: the Sparsity Map (SM) and Non-Zero Value List (NZVL):
- Sparsity Map (SM): A binary mask (1 bit per activation) encodes the location of nonzero values:
Segments of 16 bits are packed and streamed row-wise.
- NZVL: Sequential list of nonzero activation values, matching SM order. Each SM word is followed by as many NZVL entries as bits set to 1 in that word.
Compression is characterized by the sparsity ratio and, for activation width , by the dense and sparse sizes:
The compression ratio is
SM+NZVL surpasses run-length and Huffman schemes in decode simplicity and practical compression (Aimar et al., 2017).
3. Mapping Convolution Operations and Parallelization
NullHop supports convolution kernels of arbitrary size from 1Ć1 to 7Ć7 and up to 128 output maps per pass. Mapping is as follows:
- Single-Pass (): Each incoming nonzero input pixel is broadcast in parallel to kernel memory banks, where weights are fetched and convolved via controller's assigned MAC cluster. Row-wise accumulation enables on-the-fly pooling and column-wise output shifting.
- Multi-Pass (): Output maps are divided into passes, each computing up to 128 outputs.
- Operation Rate: Assuming total MAC units 0 and average sparsity 1, the effective operation rate is
2
Measured MAC utilization exceeds 98%, with the IDP and controllers dynamically balancing load by allocating decoded pixels round-robin or by hash, and providing back-pressure to prevent buffer overflow in case of I/O stalls.
This pipeline ensures that, beyond kernel-loading phases, the compute substrate remains fully utilized regardless of kernel geometry or activation density (Aimar et al., 2017).
4. Memory Hierarchy and Bandwidth Efficiency
NullHopās on-chip memory hierarchy is tailored for compressed data:
- SRAM (On-Chip):
- IDP SRAM: 512 KB for compressed feature maps and row pointers.
- Kernel Memory: 576 KB (one bank per controller), sufficient for up to 128 kernels Ć 3 weights.
- PRE buffer: 4 entries for double-row pooling/max-reduction.
- DRAM (Off-Chip):
- FPGA: DDR3, accessed via AXI4-Stream DMA.
- ASIC: Projected for LPDDR3/4.
Bandwidth reduction is substantial: For VGG16, NullHop achieves a measured per-frame I/O of 42 MB compared to 113 MB for the Eyeriss accelerator operating in batch-3 mode. Overall bandwidth savings scale with the input/output compression ratios, since only nonzero activations and their coordinates must be transferred (Aimar et al., 2017).
5. Hardware Implementation and Performance Metrics
NullHop was implemented on a Xilinx Zynq 7100 FPGA, with an ASIC implementation synthesized in 28 nm:
- FPGA Implementation:
- Clock: 60 MHz (AXI-limited).
- Resource utilization: 229k LUTs (83%), 107k FFs (19%), 386 BRAMs (51%), 128 DSP48s (6%).
- Core area (ASIC equiv.): 6.3 mm² (8.1 mm² including pads).
- Dynamic power: 0.78 W (FPGA), projected 155 mW (ASIC at 500 MHz/1 V).
- Power Efficiency (Core Only):
5
- Throughput:
- Theoretical peak (6): 7 MHz = 7.68 GOp/s.
- Observed (VGG16, sparse skipping): 17.2 GOp/s.
- Efficiency: 8.
These figures reflect the impact of zero-skipping, sparse decoding, and resource allocation, resulting in observed throughput exceeding the nominal peak for purely dense workloads (Aimar et al., 2017).
6. Benchmarking on Standard and Custom CNNs
NullHop's performance was evaluated across large-scale (VGG16, VGG19), custom (Giga1Net), and small (RoshamboNet, Face Detector) CNNs:
| Network | Device/Mode | Throughput | FPS | Efficiency |
|---|---|---|---|---|
| VGG19 | ASIC RTL (500 MHz) | 471 GOp/s | ā | 368% |
| VGG16 | ASIC RTL (500 MHz) | 421 GOp/s | ā | 329% |
| VGG19 | FPGA (60 MHz) | 16.1 GOp/s | ā | 143% |
| Giga1Net | ASIC RTL (500 MHz) | 250 GOp/s | ā | 195% |
| RoshamboNet | FPGA (60 MHz) | 3.28 GOp/s | 182 | ā |
| Face Detector | FPGA (60 MHz) | 0.61 GOp/s | 304 | ā |
NullHop demonstrates speedups and efficiency improvements over classical GPU baselines, with the ASIC implementation achieving ~471 GOp/s at 0.155 W (3 TOp/s/W), over 500Ć the power efficiency of NVIDIA Tegra X1 (6 GOp/s/W), and FPGA achieving ~17 GOp/s at 0.78 W (22 GOp/s/W), about 3ā4Ć higher than typical embedded GPUs (Aimar et al., 2017).
7. Significance, Flexibility, and Use Cases
By implementing a zero-skipping pipeline, custom sparse feature-map encoding, and dynamic pixel-to-MAC scheduling, NullHop achieves over 98% MAC utilization across diverse kernel sizes and topologies. Compression factors up to 6Ć are observed, with throughput up to 471 GOp/s on 128 MACs. The coreās flexibilityāsupporting kernel sizes of 1Ć1 to 7Ć7 and up to 1024 feature mapsāenables its use across small to large CNNs, permitting real-time embedded applications and energy-constrained platforms. Demonstrated integration with a neuromorphic event camera further highlights adaptability for non-frame-based and streaming inference contexts (Aimar et al., 2017).