NullHop Architecture

Updated 24 May 2026

NullHop is a CNN accelerator architecture that leverages activation sparsity and a custom compressed format to minimize data movement and power consumption.
It features a flexible pipeline with stages for input decoding, pixel allocation, MAC operations, and on-the-fly pooling/ReLU to sustain high compute efficiency.
The design supports kernel sizes from 1×1 to 7×7 and up to 128 output feature maps per pass, delivering throughput and power efficiency that outperforms traditional GPU solutions.

NullHop is a convolutional neural network (CNN) accelerator architecture that leverages activation sparsity to optimize the compute and memory demands of state-of-the-art visual processing networks. Designed to address the inefficiencies of traditional hardware such as GPUs in terms of power consumption and compute utilization, NullHop implements a flexible, high-throughput pipeline that minimizes off-chip data movement, sustains high compute efficiency across a wide range of kernel sizes, and exploits sparsity via a custom-encoded compressed format. The system supports up to 128 input and 128 output feature maps per layer in a single pass, is compatible with kernel sizes from 1×1 to 7×7, and achieves power efficiencies far exceeding typical GPU or embedded solutions (Aimar et al., 2017).

1. Architectural Overview and Dataflow

NullHop accepts compressed feature maps from off-chip DRAM via DMA, processes them through a pipeline comprising input decoding, pixel allocation, sparse matrix controllers, a MAC array, pooling/ReLU/encoding, and outputs the result via DMA. The principal components are:

Input Decoder (IDP): Stores up to 512 KB of compressed inputs in SRAM, uses “Input Tracker” row pointers, and $k_h{+}1$ “IDP Manager” FSMs to read only nonzero activations per vertical stripe, implementing cycle-level zero-skipping.
Pixel Allocator: Assigns nonzero pixels to one of $C$ sparse matrix controllers based on feature-map index.
Sparse Matrix Controllers: Feed clusters of $M_c$ MAC units using associated kernel memory banks.
MAC Array: Consists of 128 MAC units partitioned among eight controllers, supporting full parallelism for up to 128 convolutions.
Pooling/ReLU/Encoder (PRE): Holds double-row partial sums, performs on-the-fly 2×2 max-pooling and ReLU, and applies output compression.
Output Encoder: Re-compresses feature maps and streams them to external DRAM.

The dataflow facilitates sustained, near-peak utilization of compute resources, utilizing compressed input and output formats throughout all on-chip stages (Aimar et al., 2017).

2. Sparse Feature-Map Encoding and Compression

NullHop employs a two-stage encoding: the Sparsity Map (SM) and Non-Zero Value List (NZVL):

Sparsity Map (SM): A binary mask (1 bit per activation) encodes the location of nonzero values:

$\mathrm{SM}(i,x,y)= \begin{cases} 1 & \text{if }F^{in}_i(x,y)\neq0,\ 0 & \text{otherwise.} \end{cases}$

Segments of 16 bits are packed and streamed row-wise.

NZVL: Sequential list of nonzero activation values, matching SM order. Each SM word is followed by as many NZVL entries as bits set to 1 in that word.

Compression is characterized by the sparsity ratio $\rho=\frac{N_{nz}}{N}$ and, for activation width $P$ , by the dense and sparse sizes:

$S_{dense}=N\times P,\qquad S_{sparse}=N+N_{nz}\times P.$

The compression ratio is

$\mathrm{CR}=\frac{P}{(1{-}\rho)P+\rho}.$

SM+NZVL surpasses run-length and Huffman schemes in decode simplicity and practical compression (Aimar et al., 2017).

3. Mapping Convolution Operations and Parallelization

NullHop supports convolution kernels of arbitrary size from 1×1 to 7×7 and up to 128 output maps per pass. Mapping is as follows:

Single-Pass ( $N_{out}{\le}128$ ): Each incoming nonzero input pixel is broadcast in parallel to kernel memory banks, where weights are fetched and convolved via controller's assigned MAC cluster. Row-wise accumulation enables on-the-fly pooling and column-wise output shifting.
Multi-Pass ( $N_{out}>128$ ): Output maps are divided into passes, each computing up to 128 outputs.
Operation Rate: Assuming total MAC units $C$ 0 and average sparsity $C$ 1, the effective operation rate is

$C$ 2

Measured MAC utilization exceeds 98%, with the IDP and controllers dynamically balancing load by allocating decoded pixels round-robin or by hash, and providing back-pressure to prevent buffer overflow in case of I/O stalls.

This pipeline ensures that, beyond kernel-loading phases, the compute substrate remains fully utilized regardless of kernel geometry or activation density (Aimar et al., 2017).

4. Memory Hierarchy and Bandwidth Efficiency

NullHop’s on-chip memory hierarchy is tailored for compressed data:

SRAM (On-Chip):
- IDP SRAM: 512 KB for compressed feature maps and row pointers.
- Kernel Memory: 576 KB (one bank per controller), sufficient for up to 128 kernels × $C$ 3 weights.
- PRE buffer: $C$ 4 entries for double-row pooling/max-reduction.
DRAM (Off-Chip):
- FPGA: DDR3, accessed via AXI4-Stream DMA.
- ASIC: Projected for LPDDR3/4.

Bandwidth reduction is substantial: For VGG16, NullHop achieves a measured per-frame I/O of 42 MB compared to 113 MB for the Eyeriss accelerator operating in batch-3 mode. Overall bandwidth savings scale with the input/output compression ratios, since only nonzero activations and their coordinates must be transferred (Aimar et al., 2017).

5. Hardware Implementation and Performance Metrics

NullHop was implemented on a Xilinx Zynq 7100 FPGA, with an ASIC implementation synthesized in 28 nm:

FPGA Implementation:
- Clock: 60 MHz (AXI-limited).
- Resource utilization: 229k LUTs (83%), 107k FFs (19%), 386 BRAMs (51%), 128 DSP48s (6%).
- Core area (ASIC equiv.): 6.3 mm² (8.1 mm² including pads).
- Dynamic power: 0.78 W (FPGA), projected 155 mW (ASIC at 500 MHz/1 V).
Power Efficiency (Core Only):

$C$ 5

Throughput:
- Theoretical peak ( $C$ 6): $C$ 7 MHz = 7.68 GOp/s.
- Observed (VGG16, sparse skipping): 17.2 GOp/s.
- Efficiency: $C$ 8.

These figures reflect the impact of zero-skipping, sparse decoding, and resource allocation, resulting in observed throughput exceeding the nominal peak for purely dense workloads (Aimar et al., 2017).

6. Benchmarking on Standard and Custom CNNs

NullHop's performance was evaluated across large-scale (VGG16, VGG19), custom (Giga1Net), and small (RoshamboNet, Face Detector) CNNs:

Network	Device/Mode	Throughput	FPS	Efficiency
VGG19	ASIC RTL (500 MHz)	471 GOp/s	–	368%
VGG16	ASIC RTL (500 MHz)	421 GOp/s	–	329%
VGG19	FPGA (60 MHz)	16.1 GOp/s	–	143%
Giga1Net	ASIC RTL (500 MHz)	250 GOp/s	–	195%
RoshamboNet	FPGA (60 MHz)	3.28 GOp/s	182	–
Face Detector	FPGA (60 MHz)	0.61 GOp/s	304	–

NullHop demonstrates speedups and efficiency improvements over classical GPU baselines, with the ASIC implementation achieving ~471 GOp/s at 0.155 W (3 TOp/s/W), over 500× the power efficiency of NVIDIA Tegra X1 (6 GOp/s/W), and FPGA achieving ~17 GOp/s at 0.78 W (22 GOp/s/W), about 3–4× higher than typical embedded GPUs (Aimar et al., 2017).

7. Significance, Flexibility, and Use Cases

By implementing a zero-skipping pipeline, custom sparse feature-map encoding, and dynamic pixel-to-MAC scheduling, NullHop achieves over 98% MAC utilization across diverse kernel sizes and topologies. Compression factors up to 6× are observed, with throughput up to 471 GOp/s on 128 MACs. The core’s flexibility—supporting kernel sizes of 1×1 to 7×7 and up to 1024 feature maps—enables its use across small to large CNNs, permitting real-time embedded applications and energy-constrained platforms. Demonstrated integration with a neuromorphic event camera further highlights adaptability for non-frame-based and streaming inference contexts (Aimar et al., 2017).

Markdown Report Issue Upgrade to Chat

References (1)

NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps (2017)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to NullHop Architecture.

NullHop Architecture

1. Architectural Overview and Dataflow

2. Sparse Feature-Map Encoding and Compression

3. Mapping Convolution Operations and Parallelization

4. Memory Hierarchy and Bandwidth Efficiency

5. Hardware Implementation and Performance Metrics

6. Benchmarking on Standard and Custom CNNs

7. Significance, Flexibility, and Use Cases

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

NullHop Architecture

1. Architectural Overview and Dataflow

2. Sparse Feature-Map Encoding and Compression

3. Mapping Convolution Operations and Parallelization

4. Memory Hierarchy and Bandwidth Efficiency

5. Hardware Implementation and Performance Metrics

6. Benchmarking on Standard and Custom CNNs

7. Significance, Flexibility, and Use Cases

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research