A Survey of Methods for Low-Power Deep Learning and Computer Vision (2003.11066v1)

Published 24 Mar 2020 in cs.CV

Abstract: Deep neural networks (DNNs) are successful in many computer vision tasks. However, the most accurate DNNs require millions of parameters and operations, making them energy, computation and memory intensive. This impedes the deployment of large DNNs in low-power devices with limited compute resources. Recent research improves DNN models by reducing the memory requirement, energy consumption, and number of operations without significantly decreasing the accuracy. This paper surveys the progress of low-power deep learning and computer vision, specifically in regards to inference, and discusses the methods for compacting and accelerating DNN models. The techniques can be divided into four major categories: (1) parameter quantization and pruning, (2) compressed convolutional filters and matrix factorization, (3) network architecture search, and (4) knowledge distillation. We analyze the accuracy, advantages, disadvantages, and potential solutions to the problems with the techniques in each category. We also discuss new evaluation metrics as a guideline for future research.

PDF Abstract

This paper, "A Survey of Methods for Low-Power Deep Learning and Computer Vision" (Goel et al., 2020 ), addresses the critical challenge of deploying deep neural networks (DNNs) on resource-constrained devices like mobile phones and embedded systems. While large DNNs excel in computer vision tasks, their high computation, memory, and energy requirements make them unsuitable for such low-power environments. The survey focuses specifically on software-based techniques for efficient DNN inference, categorizing them into four main areas.

The first category is Parameter Quantization and Pruning. Quantization reduces the precision of DNN parameters, lowering memory and computation costs. Techniques range from using reduced fixed-point formats to extreme binarization (1-bit parameters). While this significantly decreases energy consumption, it can increase error, especially at very low bit-widths. Pruning removes redundant parameters and connections based on importance measures, reducing model size and complexity. It can be applied to fully-connected and convolutional layers. When combined, pruning, quantization, and encoding can achieve drastic model size reductions (e.g., VGG-16 to 2% of original size). However, a major practical disadvantage is the significant training cost associated with iterating pruning and retraining. Pruning often creates sparse matrices, which are difficult to implement efficiently on standard hardware (CPUs/GPUs) without specialized data structures or hardware support. Channel-level pruning is suggested as a potential improvement to avoid unstructured sparsity.

The second category covers Compressed Convolutional Filters and Matrix Factorization. Convolutional layers are computationally intensive. Techniques like SqueezeNet [SQN] and MobileNets [Mob] reduce computation and parameters by replacing larger convolution filters with smaller ones (like 1x1 convolutions) or using depthwise separable convolutions. MobileNets utilize bottleneck layers and depthwise separable convolutions to achieve high accuracy with fewer parameters and operations. Advantages include significant memory and latency reduction and compatibility with other optimization techniques. Practical implementation challenges include the computational expense of 1x1 convolutions in small networks and the low arithmetic intensity of depthwise separable convolutions, which can be inefficient on hardware unless memory access is optimized. Matrix factorization methods decompose large layers into smaller matrices to eliminate redundant operations. Techniques like Canonical Polyadic Decomposition (CPD) and Batch Normalization Decomposition (BMD) show significant performance gains with small accuracy loss by creating dense, factorized matrices. However, understanding why certain factorizations work better is an open problem, and the computational cost of the factorization process itself can be high, especially for large DNNs, requiring complex hyperparameter searches. Learning hyperparameters during training is suggested to mitigate this.

The third category is Network Architecture Search (NAS). NAS automates the process of finding DNN architectures optimized for specific tasks and devices, balancing accuracy and performance metrics like latency or energy. Methods typically use a controller (often an RNN) and reinforcement learning to explore potential architectures, evaluating them based on validation accuracy on a target device. Techniques like MNasNet [MNas] demonstrate finding architectures that are significantly faster and smaller than manually designed ones while maintaining or improving accuracy. The primary drawback is the extremely high computational cost, requiring thousands or tens of thousands of GPU hours to search for an optimal architecture, although proxy-based methods and gradient-based approaches like Proxyless-NAS [proxyless] aim to reduce this. Parallel training with adaptive learning rates is suggested as a potential improvement for reducing search time.

The final category discussed is Knowledge Distillation (KD). This involves training a smaller "student" DNN to mimic the output, features, or activations of a larger, more accurate "teacher" DNN. The idea is that the teacher's outputs (e.g., softened softmax probabilities) contain valuable information beyond just hard labels, helping the student learn complex functions more effectively than training on ground truth labels alone. KD can significantly reduce the computation cost of large pre-trained models. Early methods trained the student on data labeled by the teacher [ba]. Later methods like those proposed by Hinton et al. [hinton] use softened outputs, while others match feature vectors [Li] or layer-wise activations [fitnet]. Challenges include strict assumptions on the structural similarity between the student and teacher networks and reliance on softmax output layers. Training the student to mimic neuron activation sequences rather than just outputs or features is suggested to improve generalizability.

The paper concludes by emphasizing that there is no single best technique; they are often complementary and can be combined. Practical guidelines suggest:

Using quantization via libraries like NVIDIA's TensorRT for efficient implementation.
Applying pruning and compression when optimizing large pre-trained models.
Using compressed filters and matrix factorizations when training new models from scratch.
Considering NAS for device-specific optimization, but being aware of its computational cost and potential inefficiency on standard hardware for architectures with many branches.
Applying knowledge distillation for small to medium-sized datasets where fewer structural assumptions between teacher and student are needed.

Finally, the survey stresses the importance of evaluating low-power DNNs with comprehensive metrics beyond just test accuracy on large datasets. This includes measuring the number of parameters (for memory), operations (for computation), and, crucially, actual energy consumption by deploying and testing on the target device, as operations and parameters are not always directly proportional to energy usage.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Abhinav Goel (12 papers)
Caleb Tung (11 papers)
Yung-Hsiang Lu (27 papers)
George K. Thiruvathukal (48 papers)

Citations (85)

View on Semantic Scholar

A Survey of Methods for Low-Power Deep Learning and Computer Vision (2003.11066v1)

Related Papers