Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EPSD: Early Pruning with Self-Distillation for Efficient Model Compression (2402.00084v1)

Published 31 Jan 2024 in cs.LG, cs.AI, and cs.CV

Abstract: Neural network compression techniques, such as knowledge distillation (KD) and network pruning, have received increasing attention. Recent work `Prune, then Distill' reveals that a pruned student-friendly teacher network can benefit the performance of KD. However, the conventional teacher-student pipeline, which entails cumbersome pre-training of the teacher and complicated compression steps, makes pruning with KD less efficient. In addition to compressing models, recent compression techniques also emphasize the aspect of efficiency. Early pruning demands significantly less computational cost in comparison to the conventional pruning methods as it does not require a large pre-trained model. Likewise, a special case of KD, known as self-distillation (SD), is more efficient since it requires no pre-training or student-teacher pair selection. This inspires us to collaborate early pruning with SD for efficient model compression. In this work, we propose the framework named Early Pruning with Self-Distillation (EPSD), which identifies and preserves distillable weights in early pruning for a given SD task. EPSD efficiently combines early pruning and self-distillation in a two-step process, maintaining the pruned network's trainability for compression. Instead of a simple combination of pruning and SD, EPSD enables the pruned network to favor SD by keeping more distillable weights before training to ensure better distillation of the pruned network. We demonstrated that EPSD improves the training of pruned networks, supported by visual and quantitative analyses. Our evaluation covered diverse benchmarks (CIFAR-10/100, Tiny-ImageNet, full ImageNet, CUB-200-2011, and Pascal VOC), with EPSD outperforming advanced pruning and SD techniques.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (72)
  1. Combining weight pruning and knowledge distillation for cnn compression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3191–3198.
  2. Prospect Pruning: Finding Trainable Weights at Initialization using Meta-Gradients. International Conference on Learning Representations.
  3. Stronger generalization bounds for deep nets via a compression approach. In International Conference on Machine Learning, 254–263. PMLR.
  4. Deep rewiring: Training very sparse deep networks. International Conference on Learning Representations.
  5. Distilling knowledge via knowledge review. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5008–5017.
  6. Progressive skeletonization: Trimming more fat from a network at initialization. International Conference on Learning Representations.
  7. Imagenet: A large-scale hierarchical image database. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 248–255.
  8. Sharp minima can generalize for deep nets. In International Conference on Machine Learning, 1019–1028. PMLR.
  9. Rigging the lottery: Making all tickets winners. In International Conference on Machine Learning, 2943–2952. PMLR.
  10. The pascal visual object classes challenge: A retrospective. International journal of computer vision, 111: 98–136.
  11. The lottery ticket hypothesis: Finding sparse, trainable neural networks. International Conference on Learning Representations.
  12. Linear mode connectivity and the lottery ticket hypothesis. In International Conference on Machine Learning, 3259–3269. PMLR.
  13. Pruning neural networks at initialization: Why are we missing the mark? International Conference on Learning Representations.
  14. Discrete model compression with resource constraint for deep neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 1899–1908.
  15. Bias-reduced uncertainty estimation for deep neural classifiers. arXiv preprint arXiv:1805.08206.
  16. Differentiable soft quantization: Bridging full-precision and low-bit neural networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4852–4861.
  17. Channel pruning guided by classification loss and feature importance. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 10885–10892.
  18. Multidimensional Pruning and Its Extension: A Unified Framework for Model Compression. IEEE Transactions on Neural Networks and Learning Systems.
  19. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. International Conference on Learning Representations.
  20. Learning both weights and connections for efficient neural network. Advances in Neural Information Processing Systems, 28.
  21. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 770–778.
  22. Distilling the knowledge in a neural network. Advances in Neural Information Processing Systems Workshop.
  23. CP3: Channel Pruning Plug-In for Point-Based Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5302–5312.
  24. Quantized neural networks: Training neural networks with low precision weights and activations. The Journal of Machine Learning Research, 18(1): 6869–6898.
  25. Training your sparse neural network better with any mask. In International Conference on Machine Learning, 9833–9844. PMLR.
  26. Self-knowledge distillation with progressive refinement of targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6567–6576.
  27. Learning multiple layers of features from tiny images. (Technical Report).
  28. Optimal brain damage. Advances in Neural Information Processing Systems, 2.
  29. Self-supervised label augmentation via input transformations. In International Conference on Machine Learning, 5714–5724. PMLR.
  30. A signal propagation perspective for pruning neural networks at initialization. International Conference on Learning Representations.
  31. Snip: Single-shot network pruning based on connection sensitivity. International Conference on Learning Representations.
  32. Visualizing the loss landscape of neural nets. Advances in Neural Information Processing Systems, 31.
  33. Autocompress: An automatic dnn structured pruning framework for ultra-high compression rates. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 34, 4876–4883.
  34. Lottery Ticket Preserves Weight Correlation: Is It Desirable or Not? In International Conference on Machine Learning, 7011–7020. PMLR.
  35. Do we actually need dense over-parameterization? in-time over-parameterization in sparse training. In International Conference on Machine Learning, 6989–7000. PMLR.
  36. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 3431–3440.
  37. Mobilevit: light-weight, general-purpose, and mobile-friendly vision transformer. International Conference on Learning Representations.
  38. Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 5191–5198.
  39. Self-distillation amplifies regularization in hilbert space. Advances in Neural Information Processing Systems, 33: 3351–3361.
  40. Scalable training of artificial neural networks with adaptive sparse connectivity inspired by network science. Nature communications, 9(1): 1–12.
  41. Pruning convolutional neural networks for resource efficient inference. International Conference on Learning Representations.
  42. Parameter efficient training of deep convolutional neural networks by dynamic sparse reparameterization. In International Conference on Machine Learning, 4646–4655. PMLR.
  43. Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence.
  44. Unveiling the potential of structure preserving for weakly supervised object localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11642–11651.
  45. Prune your model before distill it. In European Conference on Computer Vision, 120–136. Springer.
  46. Pytorch: An imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32.
  47. Reed, R. 1993. Pruning algorithms-a survey. IEEE Transactions on Neural Networks, 4(5): 740–747.
  48. Comparing rewinding and fine-tuning in neural network pruning. arXiv preprint arXiv:2003.02389.
  49. Fitnets: Hints for thin deep nets. International Conference on Learning Representations.
  50. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 4510–4520.
  51. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. International Conference on Learning Representations.
  52. Self-Distillation from the Last Mini-Batch for Consistency Regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11943–11952.
  53. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.
  54. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2818–2826.
  55. Pruning neural networks without any data by iteratively conserving synaptic flow. Advances in Neural Information Processing Systems, 33: 6377–6389.
  56. Single shot structured pruning before training. arXiv preprint arXiv:2007.00389.
  57. The caltech-ucsd birds-200-2011 dataset. california institute of technology.
  58. Picking winning tickets before training by preserving gradient flow. International Conference on Learning Representations.
  59. Trainability Preserving Neural Pruning. International Conference on Learning Representations.
  60. Dynamical isometry: The missing ingredient for neural network pruning. arXiv preprint arXiv:2105.05916.
  61. Recent advances on neural network pruning at initialization. In Proceedings of the International Joint Conference on Artificial Intelligence, 23–29.
  62. Knowledge distillation and student-teacher learning for visual intelligence: A review and new outlooks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(6): 3048–3068.
  63. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35: 17402–17414.
  64. Wightman, R. 2019. PyTorch Image Models. https://github.com/rwightman/pytorch-image-models.
  65. Data-distortion guided self-distillation for deep neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, 5565–5572.
  66. Snapshot distillation: Teacher-student optimization in one generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2859–2868.
  67. Revisiting knowledge distillation via label smoothing regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 3903–3911.
  68. Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 13876–13885.
  69. Wide residual networks. arXiv preprint arXiv:1605.07146.
  70. Self-distillation: Towards efficient and compact neural networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(8): 4388–4403.
  71. Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 3713–3722.
  72. Accelerating very deep convolutional networks for classification and detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 38(10): 1943–1955.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Dong Chen (218 papers)
  2. Ning Liu (199 papers)
  3. Yichen Zhu (51 papers)
  4. Zhengping Che (41 papers)
  5. Rui Ma (112 papers)
  6. Fachao Zhang (2 papers)
  7. Xiaofeng Mou (7 papers)
  8. Yi Chang (150 papers)
  9. Jian Tang (326 papers)
Citations (1)