Accelerating PoT Quantization on Edge Devices (2409.20403v2)

Published 30 Sep 2024 in cs.AR and cs.LG

Abstract: Non-uniform quantization, such as power-of-two (PoT) quantization, matches data distributions better than uniform quantization, which reduces the quantization error of Deep Neural Networks (DNNs). PoT quantization also allows bit-shift operations to replace multiplications, but there are limited studies on the efficiency of shift-based accelerators for PoT quantization. Furthermore, existing pipelines for accelerating PoT-quantized DNNs on edge devices are not open-source. In this paper, we first design shift-based processing elements (shift-PE) for different PoT quantization methods and evaluate their efficiency using synthetic benchmarks. Then we design a shift-based accelerator using our most efficient shift-PE and propose PoTAcc, an open-source pipeline for end-to-end acceleration of PoT-quantized DNNs on resource-constrained edge devices. Using PoTAcc, we evaluate the performance of our shift-based accelerator across three DNNs. On average, it achieves a 1.23x speedup and 1.24x energy reduction compared to a multiplier-based accelerator, and a 2.46x speedup and 1.83x energy reduction compared to CPU-only execution. Our code is available at https://github.com/gicLAB/PoTAcc

Authors (3)

Rappy Saha (2 papers)
Jude Haris (5 papers)
José Cano (33 papers)

Summary

Accelerating PoT Quantization on Edge Devices

This paper presents an investigation into optimizing power-of-two (PoT) quantization for Deep Neural Networks (DNNs) on resource-constrained edge devices. The authors design, implement, and evaluate a novel shift-based accelerator and introduce an open-source pipeline called PoTAcc for the end-to-end deployment of PoT-quantized DNNs.

Key Contributions

The paper's contributions are threefold:

Designing Shift-Based Processing Elements (PEs): The authors create shift-PEs for three distinct PoT quantization methods and assess their hardware efficiency. The methods vary in complexity: QKeras with a single PoT term, and APoT and MSQ with multiple PoT terms.
Development of PoTAcc Pipeline: This open-source pipeline facilitates the entire workflow from PoT quantization of DNNs to their deployment and evaluation on edge devices.
Performance Evaluation: The accelerator's performance is validated using three well-known DNN models on the ImageNet dataset, showcasing significant improvements in speed and energy efficiency.

PoT Quantization Methods and Shift-PE Design

PoT quantization leverages non-uniform quantization levels, replacing multiplications with bit-shift operations, which are computationally cheaper. The authors detail the design of shift-PEs tailored to three PoT quantization methods:

QKeras: Utilizes a single PoT term simplifying hardware design.
APoT and MSQ: Employ multiple PoT terms, requiring additional multiplexer logic.

Hardware synthesis results reveal that the QKeras method, due to its simplicity, achieves the best trade-off between resource utilization and execution speed. This finding aligns with the importance of minimizing the number of bit-shift operations and associated control logic in resource-constrained environments.

Accelerator Design and PoTAcc Pipeline

The accelerator replaces traditional multiply-accumulate units with shift-PEs. Integrated within the SECDA-TFLite framework, this shift-based accelerator is capable of accelerating convolutional (conv) layers of DNNs, while other layers run on the CPU.

The PoTAcc pipeline supports model training using TensorFlow, quantization with QKeras, and deployment leveraging TFLite. This workflow ensures efficient transformation and optimization of DNN models for edge inference, supporting the broader adoption of PoT quantization techniques.

Performance and Efficiency Analysis

The paper offers an in-depth evaluation of the designed shift-PEs using synthetic benchmarks and real-world DNNs:

Synthetic Benchmarks: Testing with varying matrix sizes confirms that the QKeras-based shift-PE outperforms other designs, achieving speedups of 1.60x and energy reductions of 1.55x compared to 8-bit integer multiplier-based PEs.
End-to-End DNN Evaluation: Results on MobileNetV2, ResNet18, and InceptionV1 demonstrate substantial improvements. Specifically, the shift-based accelerator attains an average speedup of 2.46x and energy reduction of 1.83x over CPU-only execution. Comparatively, it outperforms a traditional multiplier-based accelerator with a 1.23x speedup and a 1.24x energy reduction.

Implications and Future Work

The implications of this research are twofold:

Practical Implications: The proposed shift-PE design and PoTAcc pipeline facilitate efficient deployment of PoT-quantized DNNs on edge devices, bridging the gap between floating-point accuracy and resource constraints.
Theoretical Implications: The detailed analysis and open-source contributions provide a foundation for further research in hardware-efficient non-uniform quantization methods.

Future developments may include evaluation with other DNN architectures, integration of additional PoT quantization schemes, and comprehensive accuracy assessments. This research paves the way for more extensive exploration and optimization of quantization techniques in AI, aiming to further reduce power consumption and enhance inference speed on edge devices.

In conclusion, the intricate balance achieved by the shift-based accelerator and the comprehensive PoTAcc pipeline represents significant progress in deploying efficient DNN inference on edge devices. The well-documented and open-source nature of this work will likely inspire further investigation and innovation in hardware-aware DNN quantization.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/jcanore/status/1841521800494174452

https://twitter.com/WWVY/status/1848952230013005878