Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

IDKM: Memory Efficient Neural Network Quantization via Implicit, Differentiable k-Means (2312.07759v2)

Published 12 Dec 2023 in cs.LG

Abstract: Compressing large neural networks with minimal performance loss is crucial to enabling their deployment on edge devices. (Cho et al., 2022) proposed a weight quantization method that uses an attention-based clustering algorithm called differentiable $k$-means (DKM). Despite achieving state-of-the-art results, DKM's performance is constrained by its heavy memory dependency. We propose an implicit, differentiable $k$-means algorithm (IDKM), which eliminates the major memory restriction of DKM. Let $t$ be the number of $k$-means iterations, $m$ be the number of weight-vectors, and $b$ be the number of bits per cluster address. IDKM reduces the overall memory complexity of a single $k$-means layer from $\mathcal{O}(t \cdot m \cdot 2b)$ to $\mathcal{O}( m \cdot 2b)$. We also introduce a variant, IDKM with Jacobian-Free-Backpropagation (IDKM-JFB), for which the time complexity of the gradient calculation is independent of $t$ as well. We provide a proof of concept of our methods by showing that, under the same settings, IDKM achieves comparable performance to DKM with less compute time and less memory. We also use IDKM and IDKM-JFB to quantize a large neural network, Resnet18, on hardware where DKM cannot train at all.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (33)
  1. Neural machine translation by jointly learning to align and translate, 2015. URL https://arxiv.org/abs/1409.0473.
  2. Deep equilibrium models. Advances in Neural Information Processing Systems, 32, 2019.
  3. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in Neural Information Processing Systems, 32, 2019.
  4. Estimating or propagating gradients through stochastic neurons for conditional computation. arXiv preprint arXiv:1308.3432, 2013.
  5. Neural ordinary differential equations. Advances in Neural Information Processing Systems, 31, 2018.
  6. Metaquant: Learning to quantize by learning to penetrate non-differentiable quantization. 32, 2019. URL https://proceedings.neurips.cc/paper/2019/file/f8e59f4b2fe7c5705bf878bbd494ccdf-Paper.pdf.
  7. DKM: Differentiable k𝑘kitalic_k-means clustering layer for neural network compression. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=J_F_qqCE3Z5.
  8. Recurrent stacking of layers for compact neural machine translation models. Proceedings of the AAAI Conference on Artificial Intelligence, 33:6292–6299, 2019. URL https://ojs.aaai.org/index.php/AAAI/article/view/4590.
  9. Hawq: Hessian aware quantization of neural networks with mixed-precision. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
  10. Implicit deep learning. SIAM Journal on Mathematics of Data Science, 3(3):930–958, 2021.
  11. Post-training piecewise linear quantization for deep neural networks. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part II 16, pp.  69–86. Springer, 2020.
  12. Jfb: Jacobian-free backpropagation for implicit networks, 2021. URL https://arxiv.org/abs/2103.12803.
  13. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding, 2015a. URL https://arxiv.org/abs/1510.00149.
  14. Deep compression: Compressing deep neural networks with pruning, trained quantization and Huffman coding, 2015b. URL https://arxiv.org/abs/1510.00149.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  770–778, 2016.
  16. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
  17. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
  18. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <0.5absent0.5<0.5< 0.5 mb model size. arXiv preprint arXiv:1602.07360, 2016.
  19. Robust implicit networks via non-Euclidean contractions. Advances in Neural Information Processing Systems, 34:9857–9868, 2021.
  20. Learning multiple layers of features from tiny images. 2009.
  21. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  22. Towards accurate binary convolutional neural network. CoRR, abs/1711.11294, 2017.
  23. O’Neill, J. An overview of neural network compression. CoRR, abs/2006.03669, 2020.
  24. PROFIT: A novel training method for sub-4-bit mobilenet models. CoRR, abs/2008.04693, 2020.
  25. A comprehensive survey on model quantization for deep neural networks. arXiv preprint arXiv:2205.07877, 2022.
  26. And the bit goes down: Revisiting the quantization of neural networks. arXiv preprint arXiv:1907.05686, 2019.
  27. Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning, pp. 6105–6114, 2019.
  28. Soft weight-sharing for neural network compression. arXiv preprint arXiv:1702.04008, 2017.
  29. Methods for pruning deep neural networks. IEEE Access, 10:63280–63300, 2022.
  30. HAQ: hardware-aware automated quantization. CoRR, abs/1811.08886, 2018.
  31. Monotone operator equilibrium networks. Advances in Neural Information Processing Systems, 33, 2020.
  32. Deep k-means: Re-training and parameter sharing with harder cluster assignments for compressing deep convolutions. CoRR, abs/1806.09228, 2018.
  33. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.  6848–6856, 2018.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Sean Jaffe (4 papers)
  2. Ambuj K. Singh (25 papers)
  3. Francesco Bullo (141 papers)