Boosting Residual Networks with Group Knowledge (2308.13772v2)
Abstract: Recent research understands the residual networks from a new perspective of the implicit ensemble model. From this view, previous methods such as stochastic depth and stimulative training have further improved the performance of the residual network by sampling and training of its subnets. However, they both use the same supervision for all subnets of different capacities and neglect the valuable knowledge generated by subnets during training. In this manuscript, we mitigate the significant knowledge distillation gap caused by using the same kind of supervision and advocate leveraging the subnets to provide diverse knowledge. Based on this motivation, we propose a group knowledge based training framework for boosting the performance of residual networks. Specifically, we implicitly divide all subnets into hierarchical groups by subnet-in-subnet sampling, aggregate the knowledge of different subnets in each group during training, and exploit upper-level group knowledge to supervise lower-level subnet groups. Meanwhile, We also develop a subnet sampling strategy that naturally samples larger subnets, which are found to be more helpful than smaller subnets in boosting performance for hierarchical groups. Compared with typical subnet training and other methods, our method achieves the best efficiency and performance trade-offs on multiple datasets and network structures. The code is at https://github.com/tsj-001/AAAI24-GKT.
- Feature distillation with guided adversarial contrastive learning. arXiv preprint arXiv:2009.09922.
- A Kernel Perspective of Skip Connections in Convolutional Networks. arXiv preprint arXiv:2211.14810.
- Survival-enhancing learning in the Manhattan hotel industry, 1898–1980. Management science, 44(7): 996–1016.
- Online knowledge distillation with diverse peers. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 3430–3437.
- MMDetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155.
- On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision, 4794–4802.
- Feature-map-level online adversarial knowledge distillation. In International Conference on Machine Learning, 2006–2015. PMLR.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, 248–255. Ieee.
- Learning with retrospection. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 7201–7209.
- Dietterich, T. G. 2000. Ensemble methods in machine learning. In Multiple Classifier Systems: First International Workshop, MCS 2000 Cagliari, Italy, June 21–23, 2000 Proceedings 1, 1–15. Springer.
- Overparameterization of deep ResNet: zero loss and mean-field analysis. Journal of machine learning research.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations.
- Agree to disagree: Adaptive ensemble knowledge distillation in gradient space. advances in neural information processing systems, 33: 12345–12355.
- The quality of group tacit knowledge. The Journal of Strategic Information Systems, 17(1): 4–18.
- Lrnet: lightweight recurrent network for video dehazing. Signal, Image and Video Processing, 1–9.
- Guo, J. 2022. Reducing the teacher-student gap via adaptive temperatures.
- Why resnet works? residuals generalize. IEEE transactions on neural networks and learning systems, 31(12): 5349–5362.
- Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
- Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, 2961–2969.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- Distilling the Knowledge in a Neural Network. stat, 1050: 9.
- Searching for mobilenetv3. In Proceedings of the IEEE/CVF international conference on computer vision, 1314–1324.
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4700–4708.
- Deep networks with stochastic depth. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part IV 14, 646–661. Springer.
- Experts weights averaging: A new general training scheme for vision transformers. arXiv preprint arXiv:2308.06093.
- Fair feature distillation for visual recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 12115–12124.
- Knowledge transfer between groups via personnel rotation: Effects of social identity and knowledge quality. Organizational behavior and human decision processes, 96(1): 56–71.
- Paraphrasing complex network: Network compression via factor transfer. Advances in neural information processing systems, 31.
- Self-knowledge distillation with progressive refinement of targets. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 6567–6576.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Panoptic feature pyramid networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6399–6408.
- Learning multiple layers of features from tiny images.
- Vision transformer for small-size datasets. arXiv preprint arXiv:2112.13492.
- Asymmetric Temperature Scaling Makes Larger Networks Teach Well Again. In Advances in Neural Information Processing Systems.
- Rethinking the BERT-like Pretraining for DNA Sequences. arXiv preprint arXiv:2310.07644.
- Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, 740–755. Springer.
- Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, 10012–10022.
- Automatic Loss Function Search for Adversarial Unsupervised Domain Adaptation. IEEE Transactions on Circuits and Systems for Video Technology.
- Improved knowledge distillation via teacher assistant. In Proceedings of the AAAI conference on artificial intelligence, volume 34, 5191–5198.
- Demystifying the Neural Tangent Kernel from a Practical Perspective: Can it be trusted for Neural Architecture Search without training? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11861–11870.
- Parameter-efficient and student-friendly knowledge distillation. arXiv preprint arXiv:2205.15308.
- Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
- CompOFA–Compound Once-For-All Networks for Faster Multi-Platform Deployment. In International Conference on Learning Representations.
- Self-distillation from the last mini-batch for consistency regularization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11943–11952.
- A fast knowledge distillation framework for visual recognition. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXIV, 673–690. Springer.
- Low-degree term first in ResNet, its variants and the whole neural network family. Neural Networks, 148: 155–165.
- Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, 1–9.
- Efficientnetv2: Smaller models and faster training. In International conference on machine learning, 10096–10106. PMLR.
- Mlp-mixer: An all-mlp architecture for vision. Advances in neural information processing systems, 34: 24261–24272.
- Going deeper with image transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 32–42.
- Attention is all you need. Advances in neural information processing systems, 30.
- Residual networks behave like ensembles of relatively shallow networks. Advances in neural information processing systems, 29.
- Peer collaborative learning for online knowledge distillation. In Proceedings of the AAAI Conference on artificial intelligence, volume 35, 10302–10310.
- Gradaug: A new regularization method for deep neural networks. Advances in Neural Information Processing Systems, 33: 14207–14218.
- β𝛽\betaitalic_β-DARTS++: Bi-level Regularization for Proxy-robust Differentiable Architecture Search. arXiv preprint arXiv:2301.06393.
- Stimulative Training++: Go Beyond The Performance Limits of Residual Networks. arXiv preprint arXiv:2305.02507.
- Efficient joint-dimensional search with solution space regularization for real-time semantic segmentation. International Journal of Computer Vision, 130(11): 2674–2694.
- β𝛽\betaitalic_β-DARTS: Beta-Decay Regularization for Differentiable Architecture Search. In 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10864–10873. IEEE.
- Stimulative Training of Residual Networks: A Social Psychology Perspective of Loafing. In Thirty-Sixth Conference on Neural Information Processing Systems.
- Regularizing class-wise predictions via self-knowledge distillation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 13876–13885.
- Wide Residual Networks. In British Machine Vision Conference 2016. British Machine Vision Association.
- Knowledge distillation by on-the-fly native ensemble. Advances in neural information processing systems, 31.
Sponsor
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.