Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search (1812.00090v1)

Published 30 Nov 2018 in cs.CV

Abstract: Recent work in network quantization has substantially reduced the time and space complexity of neural network inference, enabling their deployment on embedded and mobile devices with limited computational and memory resources. However, existing quantization methods often represent all weights and activations with the same precision (bit-width). In this paper, we explore a new dimension of the design space: quantizing different layers with different bit-widths. We formulate this problem as a neural architecture search problem and propose a novel differentiable neural architecture search (DNAS) framework to efficiently explore its exponential search space with gradient-based optimization. Experiments show we surpass the state-of-the-art compression of ResNet on CIFAR-10 and ImageNet. Our quantized models with 21.1x smaller model size or 103.9x lower computational cost can still outperform baseline quantized or even full precision models.

PDF Abstract

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search

The paper "Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search" presents a sophisticated approach to compress convolutional neural networks (ConvNets) by assigning mixed precision levels to different layers. This is achieved using a novel and efficient framework known as Differentiable Neural Architecture Search (DNAS). This approach addresses the need for reducing computational cost and model size, which is especially critical for deploying models on resource-constrained devices like mobile phones and embedded systems.

Key Contributions

The primary contribution of the paper is the introduction of DNAS, a differentiable architecture search method which optimizes layer-wise precision assignments using gradient-based techniques rather than exhaustive search methods. This approach significantly reduces the search space, making the process computationally feasible even on large datasets. The main innovations include:

Mixed Precision Quantization Model: Unlike traditional quantization methods that use uniform bit-widths for all layers, this paper proposes assigning different bit-widths to various layers, depending on their impact on network performance and size. This mixed precision model accommodates the varying sensitivity of network layers to quantization.
Differentiable Neural Architecture Search (DNAS): The DNAS framework leverages a stochastic super network to represent all possible architectures within a predefined search space. Through the use of Gumbel Softmax, a gradient-friendly approximation is obtained, effectively allowing differentiable optimization of architecture parameters to pinpoint the best configuration.
Fast Search Process: DNAS demonstrates significant computational efficiency, completing a full search on ResNet18 for ImageNet in less than five hours with 8 V100 GPUs, compared to days required by other Reinforcement Learning based NAS approaches.
Extensive Experiments: The paper provides rigorous experiments displaying how their quantized models, when applied to ResNet structures on CIFAR-10 and ImageNet datasets, outperform existing full precision and other quantized baseline models significantly in terms of model size and computational cost.

Results and Implications

The paper reports compelling quantitative achievements, with mixed precision quantized ResNet models reaching up to 21.1x compression in model size or 103.9x in computational cost reduction, while maintaining accuracy comparable to or better than full precision models. For instance, ResNet18 quantized using DNAS achieved superior accuracy than its full precision counterpart on the ImageNet dataset while reducing model size by a factor of 11.2.

These results have substantial implications both practically and theoretically. From a practical standpoint, the ability to drastically reduce computational and model size demands without sacrificing accuracy enables real-world application of state-of-the-art neural networks on devices with limited processing capabilities and memory. Theoretically, this work opens new possibilities in architecture search, challenging the conventional paradigms where exhaustive search methods dominated.

Future Directions

The DNAS framework itself offers extensibility beyond mixed precision quantization to other neural architecture search problems, suggesting potential exploration in efficient ConvNet structure discovery. Future research could expand on finding optimal architectures for other types of neural networks such as recurrent or transformer models and could further integrate with hardware-specific optimizations ensuring that these quantified models better align with real-world deployment on diverse hardware architectures.

Overall, this paper significantly advances the approach to ConvNet quantization by marrying the concepts of neural architecture search and precision optimization using differentiable methods, setting a precedent for future research in neural network efficiency improvement.

PDF Markdown Bookmark Chat (Pro)

Authors (6)

Bichen Wu (52 papers)
Yanghan Wang (4 papers)
Peizhao Zhang (40 papers)
Yuandong Tian (128 papers)
Peter Vajda (52 papers)
Kurt Keutzer (199 papers)

Citations (262)

View on Semantic Scholar

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search (1812.00090v1)

Mixed Precision Quantization of ConvNets via Differentiable Neural Architecture Search

Key Contributions

Results and Implications

Future Directions

Related Papers