NullHop: A Flexible Convolutional Neural Network Accelerator Based on Sparse Representations of Feature Maps (1706.01406v2)

Published 5 Jun 2017 in cs.CV and cs.NE

Abstract: Convolutional neural networks (CNNs) have become the dominant neural network architecture for solving many state-of-the-art (SOA) visual processing tasks. Even though Graphical Processing Units (GPUs) are most often used in training and deploying CNNs, their power efficiency is less than 10 GOp/s/W for single-frame runtime inference. We propose a flexible and efficient CNN accelerator architecture called NullHop that implements SOA CNNs useful for low-power and low-latency application scenarios. NullHop exploits the sparsity of neuron activations in CNNs to accelerate the computation and reduce memory requirements. The flexible architecture allows high utilization of available computing resources across kernel sizes ranging from 1x1 to 7x7. NullHop can process up to 128 input and 128 output feature maps per layer in a single pass. We implemented the proposed architecture on a Xilinx Zynq FPGA platform and present results showing how our implementation reduces external memory transfers and compute time in five different CNNs ranging from small ones up to the widely known large VGG16 and VGG19 CNNs. Post-synthesis simulations using Mentor Modelsim in a 28nm process with a clock frequency of 500 MHz show that the VGG19 network achieves over 450 GOp/s. By exploiting sparsity, NullHop achieves an efficiency of 368%, maintains over 98% utilization of the MAC units, and achieves a power efficiency of over 3TOp/s/W in a core area of 6.3mm$^2$. As further proof of NullHop's usability, we interfaced its FPGA implementation with a neuromorphic event camera for real time interactive demonstrations.

Citations (234)

View on Semantic Scholar

Summary

The paper demonstrates a novel CNN accelerator that leverages sparsity to skip zero-value computations, enhancing overall power efficiency.
It employs a flexible processing pipeline capable of handling various kernel sizes and up to 128 feature maps per layer.
Implemented on a Xilinx Zynq FPGA, NullHop achieves over 450 GOp/s and 3 TOp/s/W, marking a significant advancement in energy-efficient CNN processing.

Overview of NullHop: A Flexible Convolutional Neural Network Accelerator

The paper presents an innovative convolutional neural network (CNN) accelerator architecture named NullHop, focusing on the efficient computation of CNNs by leveraging sparse representations of feature maps. NullHop aims to address the significant power efficiency challenges encountered by traditional CNN implementations on graphics processing units (GPUs), particularly in low-power and low-latency applications.

NullHop's architecture capitalizes on the inherent sparsity of neuron activations in CNNs to enhance computational efficiency and reduce memory movement, thus addressing the inefficiencies present in conventional hardware accelerators. The design showcases a flexible architecture capable of processing a range of kernel sizes and offers high utilization of its computational resources while maintaining impressive power efficiency measures.

Architecture and Implementation

NullHop implements an innovative processing pipeline that includes zero-skipping, allowing it to ignore zero-value pixels in input feature maps seamlessly without unnecessary computation cycles. This capability significantly boosts the operational efficiency of the architecture, achieving an effective computational rate often exceeding nominal expectations due to its adaptive handling of sparse data. The architecture supports a range of CNN configurations, evidenced by its ability to scale with kernel sizes from 1x1 to 7x7, and processes up to 128 input and output feature maps per layer in a single pass.

The accelerator was implemented on a Xilinx Zynq FPGA platform, achieving a processing power of over 450 GOp/s for complex networks like VGG19 when synthesized at a 28nm process and simulated at 500MHz. This marks a substantial improvement in power efficiency, reporting over 3 TOp/s/W within a core area of 6.3 mm², underscoring the architectural advantages presented by NullHop's design for energy-constrained environments.

Performance Insights and Efficiency

The NullHop accelerators' capability of skipping computation over zero-valued neurons results in an efficiency that can be significantly higher than the theoretical performance of traditional architectures. This design choice is particularly beneficial for state-of-art neural networks that are naturally sparse due to activation functions like ReLU. For instance, NullHop's implementation demonstrated power efficiency improvements by reducing external memory transfer requirements, a common bottleneck in CNN implementations.

The sparsity-aware computational framework operates directly on compressed input representations using a novel compression method, which is more effective than existing run-length encoding techniques. This framework efficiently processes data while maintaining high throughput, ensuring that NullHop can sustain a high throughput of operations per watt, which is becoming increasingly critical as AI applications grow in complexity and scale.

Implications and Future Prospects

Considering the increasing demands for real-time processing and power-efficient AI solutions, the NullHop architecture offers a compelling approach to tackling these challenges. Its design principles could influence future development of hardware accelerators, particularly within applications that necessitate edge processing capabilities, such as neuromorphic systems and Internet of Things (IoT) devices.

While the paper addresses the architectural and efficiency aspects admirably, future research could explore additional design optimizations, such as integrating mixed-precision computations to further reduce power consumption or extending the architecture to efficiently execute other neural network paradigms.

In summary, NullHop provides a viable solution for efficient CNN computation in power-constrained environments, paving the way for future advancements in hardware-accelerator technologies that exploit sparsity for enhanced performance and reduced resource usage.

PDF Markdown