Network Memory Footprint Compression Through Jointly Learnable Codebooks and Mappings (2309.17361v1)
Abstract: The massive interest in deep neural networks (DNNs) for both computer vision and natural language processing has been sparked by the growth in computational power. However, this led to an increase in the memory footprint, to a point where it can be challenging to simply load a model on commodity devices such as mobile phones. To address this limitation, quantization is a favored solution as it maps high precision tensors to a low precision, memory efficient format. In terms of memory footprint reduction, its most effective variants are based on codebooks. These methods, however, suffer from two limitations. First, they either define a single codebook for each tensor, or use a memory-expensive mapping to multiple codebooks. Second, gradient descent optimization of the mapping favors jumps toward extreme values, hence not defining a proximal search. In this work, we propose to address these two limitations. First, we initially group similarly distributed neurons and leverage the re-ordered structure to either apply different scale factors to the different groups, or map weights that fall in these groups to several codebooks, without any mapping overhead. Second, stemming from this initialization, we propose a joint learning of the codebook and weight mappings that bears similarities with recent gradient-based post-training quantization techniques. Third, drawing estimation from straight-through estimation techniques, we introduce a novel gradient update definition to enable a proximal search of the codebooks and their mappings. The proposed jointly learnable codebooks and mappings (JLCM) method allows a very efficient approximation of any DNN: as such, a Llama 7B can be compressed down to 2Go and loaded on 5-year-old smartphones.
- Piqa: Reasoning about physical commonsense in natural language. In AAAI, volume 34, pp. 7432–7439, 2020.
- Liang-Chieh Chen et al. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI, pp. 834–848, 2017.
- Compressing neural networks with the hashing trick. In International conference on machine learning, pp. 2285–2294. PMLR, 2015.
- Boolq: Exploring the surprising difficulty of natural yes/no questions. arXiv preprint arXiv:1905.10044, 2019.
- Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018.
- Guo Cong et al. Squant: On-the-fly data-free quantization via diagonal hessian approximation. ICLR, 2022.
- ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
- Llm. int8 (): 8-bit matrix multiplication for transformers at scale. arXiv preprint arXiv:2208.07339, 2022.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Alexey Dosovitskiy et al. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
- Structured multi-hashing for model compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11903–11912, 2020.
- Optq: Accurate quantization for generative pre-trained transformers. In The Eleventh International Conference on Learning Representations, 2022.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Deep residual learning for image recognition. In CVPR, pp. 770–778, 2016.
- Clipscore: A reference-free evaluation metric for image captioning. arXiv preprint arXiv:2104.08718, 2021.
- Distilling the knowledge in a neural network. NeurIPS, 2014.
- Accurate post training quantization with small calibration sets. In International Conference on Machine Learning, pp. 4466–4475. PMLR, 2021.
- Acchashtag: Accelerated hashing for detecting fault-injection attacks on embedded neural networks. ACM Journal on Emerging Technologies in Computing Systems, 19(1):1–20, 2022.
- Biqgemm: matrix multiplication with lookup table for binary-coding-based quantized dnns. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–14. IEEE, 2020.
- Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018.
- A fast post-training pruning framework for transformers. Advances in Neural Information Processing Systems, 35:24101–24116, 2022.
- Compressing convolutional neural networks via factorized convolutional filters. In CVPR, pp. 3977–3986, 2019.
- Brecq: Pushing the limit of post-training quantization by block reconstruction. NeurIPS, 2021.
- Pd-quant: Post-training quantization based on prediction difference metric. In CVPR, pp. 24427–24437, 2023.
- Stuart Lloyd. Least squares quantization in pcm. IEEE transactions on information theory, 28(2):129–137, 1982.
- Can a suit of armor conduct electricity? a new dataset for open book question answering. arXiv preprint arXiv:1809.02789, 2018.
- Daniel Müllner. Modern hierarchical, agglomerative clustering algorithms. arXiv preprint arXiv:1109.2378, 2011.
- Up or down? adaptive rounding for post-training quantization. In ICML, pp. 7197–7206. PMLR, 2020.
- Markus Nagel et al. Data-free quantization through weight equalization and bias correction. ICCV, pp. 1325–1334, 2019.
- Faster r-cnn: Towards real-time object detection with region proposal networks. NeurIPS, 28:91–99, 2015.
- High-resolution image synthesis with latent diffusion models, 2021.
- Winogrande: An adversarial winograd schema challenge at scale. Communications of the ACM, 64(9):99–106, 2021.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In CVPR, pp. 4510–4520, 2018.
- Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000.
- Efficientnet: Rethinking model scaling for convolutional neural networks. ICML, pp. 6105–6114, 2019.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Attention is all you need. In NeurIPS, 2017.
- Content-based image indexing and searching using daubechies’ wavelets. International Journal on Digital Libraries, 1:311–328, 1998.
- Large-scale prompt gallery dataset for text-to-image generative models. arXiv:2210.14896 [cs], 2022. URL https://arxiv.org/abs/2210.14896.
- Qdrop: randomly dropping quantization for extremely low-bit post-training quantization. ICLR, 2022.
- Red: Looking for redundancies for data-free structured compression of deep neural networks. NeurIPS, 2021.
- Red++: Data-free pruning of deep neural networks via input splitting and output merging. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3664–3676, 2022.
- Nupes : Non-uniform post-training quantization via power exponent search. arXiv preprint arXiv:2308.05600, 2023a.
- Powerquant: Automorphism search for non-uniform quantization. In ICLR, 2023b.
- Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019.
- Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022a.
- Pokebnn: A binary pursuit of lightweight accuracy. In CVPR, pp. 12475–12485, 2022b.
- Improving neural network quantization without retraining using outlier channel splitting. In ICML, pp. 7543–7552, 2019.
Sponsored by Paperpile, the PDF & BibTeX manager trusted by top AI labs.
Get 30 days freePaper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.