Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Model compression via distillation and quantization (1802.05668v1)

Published 15 Feb 2018 in cs.NE and cs.LG
Model compression via distillation and quantization

Abstract: Deep neural networks (DNNs) continue to make significant advances, solving tasks from image classification to translation or reinforcement learning. One aspect of the field receiving considerable attention is efficiently executing deep models in resource-constrained environments, such as mobile or embedded devices. This paper focuses on this problem, and proposes two new compression methods, which jointly leverage weight quantization and distillation of larger teacher networks into smaller student networks. The first method we propose is called quantized distillation and leverages distillation during the training process, by incorporating distillation loss, expressed with respect to the teacher, into the training of a student network whose weights are quantized to a limited set of levels. The second method, differentiable quantization, optimizes the location of quantization points through stochastic gradient descent, to better fit the behavior of the teacher model. We validate both methods through experiments on convolutional and recurrent architectures. We show that quantized shallow students can reach similar accuracy levels to full-precision teacher models, while providing order of magnitude compression, and inference speedup that is linear in the depth reduction. In sum, our results enable DNNs for resource-constrained environments to leverage architecture and accuracy advances developed on more powerful devices.

Model Compression via Distillation and Quantization

In the paper "Model Compression via Distillation and Quantization," the authors address a critical issue in deploying deep neural networks (DNNs) in resource-constrained environments such as mobile devices. The paper introduces two innovative methods for model compression: quantized distillation and differentiable quantization. These techniques aim to effectively reduce model size and computation while maintaining competitive accuracy levels.

Overview of Contributions

The authors develop methods that jointly leverage knowledge distillation and weight quantization. Distillation involves transferring the learned knowledge from a larger "teacher" model to a smaller "student" model. Quantization, on the other hand, reduces model size by constraining weights to a limited set of values.

  1. Quantized Distillation: This method incorporates distillation loss during the training of quantized students, integrating both quantization and distillation processes. It allows the student model to closely mimic the performance of the teacher model despite having significantly reduced complexity.
  2. Differentiable Quantization: This method focuses on optimizing quantization points via stochastic gradient descent. By adjusting the location of these points, the method better aligns the quantized model's behavior with that of the teacher model.

Empirical Validation and Results

The methods are validated across various architectures, including convolutional and recurrent models. Experimental results demonstrate that quantized students achieve accuracy levels comparable to full-precision models with substantial compression — sometimes up to an order of magnitude in size reduction. Notably, in vision tasks such as CIFAR-10 and CIFAR-100, quantized models retained nearly the same accuracy as their full-precision counterparts, demonstrating the effectiveness of the proposed techniques.

Critical Insights and Implications

  • Automatic Inference Speedup: By reducing model depth, the methods automatically enhance inference speed, making DNNs more feasible for real-time applications on embedded systems.
  • Generalization to Large Datasets: The approaches maintain accuracy even when scaling to larger datasets, as seen in experiments with ImageNet and WMT13.
  • Quantization as Noise Addition: The paper provides a theoretical analysis showing that quantization acts as adding Gaussian noise to the model weights, which can be interpreted as a form of regularization.

Future Directions

The research opens several paths for future exploration:

  • Automated Architecture Search: Employing reinforcement learning to explore optimal student network architectures could provide further efficiency improvements.
  • Integration with Hardware Optimizations: Testing these compression methods on specialized hardware, such as FPGAs or using frameworks like NVIDIA TensorRT, could yield further performance benefits.

In conclusion, this paper contributes valuable insights and methods for compressing DNNs, enabling the execution of complex models in constrained environments without sacrificing performance. These advances could significantly impact real-world applications by making AI more accessible and efficient on edge devices.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Antonio Polino (1 paper)
  2. Razvan Pascanu (138 papers)
  3. Dan Alistarh (133 papers)
Citations (685)