Knowledge Translation: A New Pathway for Model Compression
Abstract: Deep learning has witnessed significant advancements in recent years at the cost of increasing training, inference, and model storage overhead. While existing model compression methods strive to reduce the number of model parameters while maintaining high accuracy, they inevitably necessitate the re-training of the compressed model or impose architectural constraints. To overcome these limitations, this paper presents a novel framework, termed \textbf{K}nowledge \textbf{T}ranslation (KT), wherein a ``translation'' model is trained to receive the parameters of a larger model and generate compressed parameters. The concept of KT draws inspiration from language translation, which effectively employs neural networks to convert different languages, maintaining identical meaning. Accordingly, we explore the potential of neural networks to convert models of disparate sizes, while preserving their functionality. We propose a comprehensive framework for KT, introduce data augmentation strategies to enhance model performance despite restricted training data, and successfully demonstrate the feasibility of KT on the MNIST dataset. Code is available at \url{https://github.com/zju-SWJ/KT}.
- Scalable methods for 8-bit training of neural networks. Advances in neural information processing systems, 31, 2018.
- Cross-layer distillation with semantic calibration. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp. 7028–7036, 2021.
- Knowledge distillation with the reused teacher classifier. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11933–11942, 2022.
- Shi Chen and Qi Zhao. Shallowing deep networks: Layer-wise pruning based on feature representations. IEEE transactions on pattern analysis and machine intelligence, 41(12):3048–3056, 2018.
- A comprehensive survey on model compression and acceleration. Artificial Intelligence Review, 53:5113–5155, 2020.
- A survey of quantization methods for efficient neural network inference. In Low-Power Computer Vision, pp. 291–326. Chapman and Hall/CRC, 2022.
- Knowledge distillation: A survey. International Journal of Computer Vision, 129:1789–1819, 2021.
- Mamba: Linear-time sequence modeling with selective state spaces. arXiv preprint arXiv:2312.00752, 2023.
- Learning both weights and connections for efficient neural network. Advances in neural information processing systems, 28, 2015.
- Simplifying transformer blocks. arXiv preprint arXiv:2311.01906, 2023.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 770–778, 2016.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Generalized canonical polyadic tensor decomposition. SIAM Review, 62(1):133–163, 2020.
- Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv preprint arXiv:1607.03250, 2016.
- Jonathan J. Hull. A database for handwritten text recognition research. IEEE Transactions on pattern analysis and machine intelligence, 16(5):550–554, 1994.
- N Kishore Kumar and Jan Schneider. Literature survey on low rank approximation of matrices. Linear and Multilinear Algebra, 65(11):2212–2244, 2017.
- Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
- Pruning filters for efficient convnets. arXiv preprint arXiv:1608.08710, 2016.
- Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing, 461:370–403, 2021.
- Mixed precision training. arXiv preprint arXiv:1710.03740, 2017.
- Mirasol3b: A multimodal autoregressive model for time-aligned and contextual modalities. arXiv preprint arXiv:2311.05698, 2023.
- A comprehensive survey of neural architecture search: Challenges and solutions. ACM Computing Surveys (CSUR), 54(4):1–34, 2021.
- Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550, 2014.
- Green ai. Communications of the ACM, 63(12):54–63, 2020.
- Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1):1929–1958, 2014.
- Gilbert W Stewart. On the early history of the singular value decomposition. SIAM review, 35(4):551–566, 1993.
- Accelerating diffusion sampling with classifier-based feature distillation. In 2023 IEEE International Conference on Multimedia and Expo (ICME), pp. 810–815. IEEE, 2023.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34:24261–24272, 2021.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- Ledyard R Tucker. Some mathematical notes on three-mode factor analysis. Psychometrika, 31(3):279–311, 1966.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Training deep neural networks with 8-bit floating point numbers. Advances in neural information processing systems, 31, 2018.
- Integer quantization for deep learning inference: Principles and empirical evaluation. arXiv preprint arXiv:2004.09602, 2020.
- Hawq-v3: Dyadic neural network quantization. In International Conference on Machine Learning, pp. 11875–11886. PMLR, 2021.
- A battle of network structures: An empirical study of cnn, transformer, and mlp. arXiv preprint arXiv:2108.13002, 2021.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.