Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Network Compression Framework for fast model inference (2002.08679v4)

Published 20 Feb 2020 in cs.CV and eess.IV
Neural Network Compression Framework for fast model inference

Abstract: In this work we present a new framework for neural networks compression with fine-tuning, which we called Neural Network Compression Framework (NNCF). It leverages recent advances of various network compression methods and implements some of them, such as sparsity, quantization, and binarization. These methods allow getting more hardware-friendly models which can be efficiently run on general-purpose hardware computation units (CPU, GPU) or special Deep Learning accelerators. We show that the developed methods can be successfully applied to a wide range of models to accelerate the inference time while keeping the original accuracy. The framework can be used within the training samples, which are supplied with it, or as a standalone package that can be seamlessly integrated into the existing training code with minimal adaptations. Currently, a PyTorch version of NNCF is available as a part of OpenVINO Training Extensions at https://github.com/openvinotoolkit/nncf.

Neural Network Compression Framework for Fast Model Inference

This paper introduces the Neural Network Compression Framework (NNCF), a PyTorch-based tool engineered to enhance neural network efficiency through various compression techniques. With the escalated computational requirements of deep neural networks (DNNs), this framework seeks to facilitate faster model inferences, particularly on resource-constrained hardware, by implementing quantization, sparsity, filter pruning, and binarization.

Framework Features

The authors highlight several key features of NNCF:

  • Quantization: Both symmetric and asymmetric quantization schemes are supported, with optional mixed-precision strategies. The framework enables automatic fake quantization insertion into the model graph, aiding the preservation of accuracy while optimizing model performance.
  • Binarization: Binarization of weights and activations is supported, leveraging techniques like XNOR and DoReFa, achieving a significant reduction in model complexity albeit with some accuracy trade-offs.
  • Sparsity and Pruning: Methods for both magnitude-based and regularization-based sparsification are implemented, capable of preserving accuracy while reducing network complexity. Filter pruning is also integrated, allowing the removal of less salient filters to streamline model execution.
  • Model Transformation and Stacking: NNCF performs automatic model transformation by inserting compression layers and supports stacking of multiple compression methods to achieve compounded benefits.

Numerical Results and Claims

Strong numerical results were presented across diverse model types and use cases, including image classification, object detection, and natural language processing:

  • INT8 quantization achieved up to 3.11x speed improvements with negligible accuracy drops across well-known models like ResNet50 and MobileNet variations.
  • Mixed-precision quantization showed promise in preserving accuracy within 1% of full precision, indicating a viable pathway for applications demanding extreme inference efficiency.
  • When combining sparsity and quantization, the framework consistently produced models with competitive accuracy while enhancing runtime efficiency.

Practical Implications

The practical implications of this work are significant in domains where reduced model latency and size directly contribute to better system performance, such as in mobile or embedded devices. By integrating seamlessly with existing PyTorch codebases and supporting export to ONNX for subsequent inference via OpenVINO, NNCF provides a comprehensive solution for deploying compressed models in real-world applications.

Theoretical Implications

On a theoretical level, the amalgamation of several compression techniques and the capacity to stack these within a single framework may inspire new research directions investigating the interplay and optimal configuration of different compression strategies. Additionally, the alignment of compression methodologies with hardware-specific capabilities (e.g., fixed-point arithmetic) raises important considerations for architecture design and optimization.

Future Directions

Future expansions of NNCF might include refining algorithms for ultra-low precision quantization, extending model compatibility, or incorporating real-time learning schemes adaptive to dynamically changing hardware conditions. Furthermore, as AI models become more pervasive, exploring automated or AI-driven compression strategy selection could enhance usability and model deployment effectiveness further.

In conclusion, the NNCF framework presents a robust toolset for neural network compression, effectively balancing performance gains with accuracy maintenance, therefore marking an essential resource for researchers and practitioners aiming to optimize DNN inference.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Alexander Kozlov (17 papers)
  2. Ivan Lazarevich (8 papers)
  3. Vasily Shamporov (1 paper)
  4. Nikolay Lyalyushkin (2 papers)
  5. Yury Gorbachev (2 papers)
Citations (33)
Github Logo Streamline Icon: https://streamlinehq.com