Papers
Topics
Authors
Recent
Search
2000 character limit reached

NullHop Architecture

Updated 24 May 2026
  • NullHop is a CNN accelerator architecture that leverages activation sparsity and a custom compressed format to minimize data movement and power consumption.
  • It features a flexible pipeline with stages for input decoding, pixel allocation, MAC operations, and on-the-fly pooling/ReLU to sustain high compute efficiency.
  • The design supports kernel sizes from 1Ɨ1 to 7Ɨ7 and up to 128 output feature maps per pass, delivering throughput and power efficiency that outperforms traditional GPU solutions.

NullHop is a convolutional neural network (CNN) accelerator architecture that leverages activation sparsity to optimize the compute and memory demands of state-of-the-art visual processing networks. Designed to address the inefficiencies of traditional hardware such as GPUs in terms of power consumption and compute utilization, NullHop implements a flexible, high-throughput pipeline that minimizes off-chip data movement, sustains high compute efficiency across a wide range of kernel sizes, and exploits sparsity via a custom-encoded compressed format. The system supports up to 128 input and 128 output feature maps per layer in a single pass, is compatible with kernel sizes from 1Ɨ1 to 7Ɨ7, and achieves power efficiencies far exceeding typical GPU or embedded solutions (Aimar et al., 2017).

1. Architectural Overview and Dataflow

NullHop accepts compressed feature maps from off-chip DRAM via DMA, processes them through a pipeline comprising input decoding, pixel allocation, sparse matrix controllers, a MAC array, pooling/ReLU/encoding, and outputs the result via DMA. The principal components are:

  • Input Decoder (IDP): Stores up to 512 KB of compressed inputs in SRAM, uses ā€œInput Trackerā€ row pointers, and kh+1k_h{+}1 ā€œIDP Managerā€ FSMs to read only nonzero activations per vertical stripe, implementing cycle-level zero-skipping.
  • Pixel Allocator: Assigns nonzero pixels to one of CC sparse matrix controllers based on feature-map index.
  • Sparse Matrix Controllers: Feed clusters of McM_c MAC units using associated kernel memory banks.
  • MAC Array: Consists of 128 MAC units partitioned among eight controllers, supporting full parallelism for up to 128 convolutions.
  • Pooling/ReLU/Encoder (PRE): Holds double-row partial sums, performs on-the-fly 2Ɨ2 max-pooling and ReLU, and applies output compression.
  • Output Encoder: Re-compresses feature maps and streams them to external DRAM.

The dataflow facilitates sustained, near-peak utilization of compute resources, utilizing compressed input and output formats throughout all on-chip stages (Aimar et al., 2017).

2. Sparse Feature-Map Encoding and Compression

NullHop employs a two-stage encoding: the Sparsity Map (SM) and Non-Zero Value List (NZVL):

  • Sparsity Map (SM): A binary mask (1 bit per activation) encodes the location of nonzero values:

SM(i,x,y)={1ifĀ Fiin(x,y)≠0,Ā 0otherwise.\mathrm{SM}(i,x,y)= \begin{cases} 1 & \text{if }F^{in}_i(x,y)\neq0,\ 0 & \text{otherwise.} \end{cases}

Segments of 16 bits are packed and streamed row-wise.

  • NZVL: Sequential list of nonzero activation values, matching SM order. Each SM word is followed by as many NZVL entries as bits set to 1 in that word.

Compression is characterized by the sparsity ratio ρ=NnzN\rho=\frac{N_{nz}}{N} and, for activation width PP, by the dense and sparse sizes:

Sdense=NƗP,Ssparse=N+NnzƗP.S_{dense}=N\times P,\qquad S_{sparse}=N+N_{nz}\times P.

The compression ratio is

CR=P(1āˆ’Ļ)P+ρ.\mathrm{CR}=\frac{P}{(1{-}\rho)P+\rho}.

SM+NZVL surpasses run-length and Huffman schemes in decode simplicity and practical compression (Aimar et al., 2017).

3. Mapping Convolution Operations and Parallelization

NullHop supports convolution kernels of arbitrary size from 1Ɨ1 to 7Ɨ7 and up to 128 output maps per pass. Mapping is as follows:

  • Single-Pass (Nout≤128N_{out}{\le}128): Each incoming nonzero input pixel is broadcast in parallel to kernel memory banks, where weights are fetched and convolved via controller's assigned MAC cluster. Row-wise accumulation enables on-the-fly pooling and column-wise output shifting.
  • Multi-Pass (Nout>128N_{out}>128): Output maps are divided into passes, each computing up to 128 outputs.
  • Operation Rate: Assuming total MAC units CC0 and average sparsity CC1, the effective operation rate is

CC2

Measured MAC utilization exceeds 98%, with the IDP and controllers dynamically balancing load by allocating decoded pixels round-robin or by hash, and providing back-pressure to prevent buffer overflow in case of I/O stalls.

This pipeline ensures that, beyond kernel-loading phases, the compute substrate remains fully utilized regardless of kernel geometry or activation density (Aimar et al., 2017).

4. Memory Hierarchy and Bandwidth Efficiency

NullHop’s on-chip memory hierarchy is tailored for compressed data:

  • SRAM (On-Chip):
    • IDP SRAM: 512 KB for compressed feature maps and row pointers.
    • Kernel Memory: 576 KB (one bank per controller), sufficient for up to 128 kernels Ɨ CC3 weights.
    • PRE buffer: CC4 entries for double-row pooling/max-reduction.
  • DRAM (Off-Chip):
    • FPGA: DDR3, accessed via AXI4-Stream DMA.
    • ASIC: Projected for LPDDR3/4.

Bandwidth reduction is substantial: For VGG16, NullHop achieves a measured per-frame I/O of 42 MB compared to 113 MB for the Eyeriss accelerator operating in batch-3 mode. Overall bandwidth savings scale with the input/output compression ratios, since only nonzero activations and their coordinates must be transferred (Aimar et al., 2017).

5. Hardware Implementation and Performance Metrics

NullHop was implemented on a Xilinx Zynq 7100 FPGA, with an ASIC implementation synthesized in 28 nm:

  • FPGA Implementation:
    • Clock: 60 MHz (AXI-limited).
    • Resource utilization: 229k LUTs (83%), 107k FFs (19%), 386 BRAMs (51%), 128 DSP48s (6%).
    • Core area (ASIC equiv.): 6.3 mm² (8.1 mm² including pads).
    • Dynamic power: 0.78 W (FPGA), projected 155 mW (ASIC at 500 MHz/1 V).
  • Power Efficiency (Core Only):

CC5

  • Throughput:
    • Theoretical peak (CC6): CC7 MHz = 7.68 GOp/s.
    • Observed (VGG16, sparse skipping): 17.2 GOp/s.
    • Efficiency: CC8.

These figures reflect the impact of zero-skipping, sparse decoding, and resource allocation, resulting in observed throughput exceeding the nominal peak for purely dense workloads (Aimar et al., 2017).

6. Benchmarking on Standard and Custom CNNs

NullHop's performance was evaluated across large-scale (VGG16, VGG19), custom (Giga1Net), and small (RoshamboNet, Face Detector) CNNs:

Network Device/Mode Throughput FPS Efficiency
VGG19 ASIC RTL (500 MHz) 471 GOp/s – 368%
VGG16 ASIC RTL (500 MHz) 421 GOp/s – 329%
VGG19 FPGA (60 MHz) 16.1 GOp/s – 143%
Giga1Net ASIC RTL (500 MHz) 250 GOp/s – 195%
RoshamboNet FPGA (60 MHz) 3.28 GOp/s 182 –
Face Detector FPGA (60 MHz) 0.61 GOp/s 304 –

NullHop demonstrates speedups and efficiency improvements over classical GPU baselines, with the ASIC implementation achieving ~471 GOp/s at 0.155 W (3 TOp/s/W), over 500Ɨ the power efficiency of NVIDIA Tegra X1 (6 GOp/s/W), and FPGA achieving ~17 GOp/s at 0.78 W (22 GOp/s/W), about 3–4Ɨ higher than typical embedded GPUs (Aimar et al., 2017).

7. Significance, Flexibility, and Use Cases

By implementing a zero-skipping pipeline, custom sparse feature-map encoding, and dynamic pixel-to-MAC scheduling, NullHop achieves over 98% MAC utilization across diverse kernel sizes and topologies. Compression factors up to 6Ɨ are observed, with throughput up to 471 GOp/s on 128 MACs. The core’s flexibility—supporting kernel sizes of 1Ɨ1 to 7Ɨ7 and up to 1024 feature maps—enables its use across small to large CNNs, permitting real-time embedded applications and energy-constrained platforms. Demonstrated integration with a neuromorphic event camera further highlights adaptability for non-frame-based and streaming inference contexts (Aimar et al., 2017).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NullHop Architecture.