Adaptive Depth Networks with Skippable Sub-Paths (2312.16392v3)
Abstract: Predictable adaptation of network depths can be an effective way to control inference latency and meet the resource condition of various devices. However, previous adaptive depth networks do not provide general principles and a formal explanation on why and which layers can be skipped, and, hence, their approaches are hard to be generalized and require long and complex training steps. In this paper, we present a practical approach to adaptive depth networks that is applicable to various networks with minimal training effort. In our approach, every hierarchical residual stage is divided into two sub-paths, and they are trained to acquire different properties through a simple self-distillation strategy. While the first sub-path is essential for hierarchical feature learning, the second one is trained to refine the learned features and minimize performance degradation if it is skipped. Unlike prior adaptive networks, our approach does not train every target sub-network in an iterative manner. At test time, however, we can connect these sub-paths in a combinatorial manner to select sub-networks of various accuracy-efficiency trade-offs from a single network. We provide a formal rationale for why the proposed training method can reduce overall prediction errors while minimizing the impact of skipping sub-paths. We demonstrate the generality and effectiveness of our approach with convolutional neural networks and transformers.
- Knowledge distillation: A good teacher is patient and consistent. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10925–10934, 2022.
- Flexivit: One model for all patch sizes. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- On the efficacy of knowledge distillation. In International Conference on Computer Vision (ICCV), pages 4794–4802, 2019.
- An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations (ICLR), 2021.
- Reducing transformer depth on demand with structured dropout. In International Conference on Learning Representations (ICLR), 2020.
- Depgraph: Towards any structural pruning. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 16091–16101, 2023.
- Adaptive token sampling for efficient vision transformers. In European Conference on Computer Vision (ECCV), pages 396–414, 2022.
- Spatially adaptive computation time for residual networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 1039–1048, 2017.
- NASVit: Neural architecture search for efficient vision transformers with gradient conflict aware supernet training. In International Conference on Learning Representations (ICLR), 2022.
- Highway and residual networks learn unrolled iterative estimation. In International Conference on Learning Representations (ICLR), 2016.
- Dynamic recursive neural network. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5142–5151, 2019.
- Ghostnet: More features from cheap operations. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. In International Conference on Learning Representations (ICLR), 2016.
- Dynamic neural networks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7436–7456, 2022.
- Deep residual learning for image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
- Mask r-cnn. In International Conference on Computer Vision (ICCV), pages 2961–2969, 2017.
- Rethinking spatial dimensions of vision transformers. In International Conference on Computer Vision (ICCV), 2021.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015.
- Dynabert: Dynamic bert with adaptive width and depth. Conference on Neural Information Processing Systems (NeurIPS), 33, 2020.
- Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861, 2017.
- Learning anytime predictions in neural networks via adaptive loss balancing. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3812–3821, 2019.
- Deep networks with stochastic depth. In European Conference on Computer Vision (ECCV), pages 646–661. Springer, 2016.
- Multi-scale dense networks for resource efficient image classification. In International Conference on Learning Representations (ICLR), 2018.
- Batch normalization: Accelerating deep network training by reducing internal covariate shift. In International Conference on Machine Learning (ICML), pages 448–456. PMLR, 2015.
- Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
- Residual connections encourage iterative inference. In International Conference on Learning Representations (ICLR), 2018.
- Spvit: Enabling faster vision transformers via latency-aware soft token pruning. In European Conference on Computer Vision (ECCV), pages 620–640, 2022.
- Dynamic slimmable network. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8607–8617, 2021.
- Improved techniques for training adaptive deep networks. In International Conference on Computer Vision (ICCV), pages 1891–1900, 2019.
- Learning dynamic routing for semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
- Feature pyramid networks for object detection. In Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
- Metapruning: Meta learning for automatic neural network channel pruning. In International Conference on Computer Vision (ICCV), pages 3296–3305, 2019.
- Swin transformer: Hierarchical vision transformer using shifted windows. In International Conference on Computer Vision (ICCV), 2021.
- Adavit: Adaptive vision transformers for efficient image recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 12299–12308, 2022.
- Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.
- Imagenet large scale visual recognition challenge. International journal of computer vision, 115(3):211–252, 2015.
- Mobilenetv2: Inverted residuals and linear bottlenecks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018.
- Grad-cam: Visual explanations from deep networks via gradient-based localization. In International Conference on Computer Vision (ICCV), pages 618–626, 2017.
- Very deep convolutional networks for large-scale image recognition. In 3th International Conference on Learning Representations (ICLR), 2015.
- Efficientnet: Rethinking model scaling for convolutional neural networks. In International Conference on Machine Learning (ICML), pages 6105–6114. PMLR, 2019.
- Training data-efficient image transformers & distillation through attention. arXiv preprint, 2020.
- Co-training 2l submodels for visual recognition. In Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
- Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), 2017.
- Convolutional networks with adaptive inference graphs. In European Conference on Computer Vision (ECCV), pages 3–18, 2018.
- Residual networks behave like ensembles of relatively shallow networks. Conference on Neural Information Processing Systems (NeurIPS), 29:550–558, 2016.
- Orthogonalized sgd and nested architectures for anytime neural networks. In International Conference on Machine Learning (ICML), pages 9807–9817. PMLR, 2020.
- Alphanet: Improved training of supernets with alpha-divergence. In International Conference on Machine Learning (ICML), pages 10760–10771. PMLR, 2021a.
- Neural pruning via growing regularization. In International Conference on Learning Representations (ICLR), 2021b.
- Skipnet: Learning dynamic routing in convolutional networks. In European Conference on Computer Vision (ECCV), pages 409–424, 2018.
- Blockdrop: Dynamic inference paths in residual networks. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 8817–8826, 2018.
- Self-training with noisy student improves imagenet classification. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10687–10698, 2020.
- Resolution adaptive networks for efficient inference. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2020.
- Leveraging batch normalization for vision transformers. In 2021 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pages 413–422, 2021.
- A-ViT: Adaptive tokens for efficient vision transformer. In Conference on Computer Vision and Pattern Recognition (CVPR), pages 10809–10818, 2022.
- Width & depth pruning for vision transformers. Proceedings of the AAAI Conference on Artificial Intelligence, 36(3):3143–3151, 2022.
- Autoslim: Towards one-shot architecture search for channel numbers. arXiv preprint, 2019a.
- Universally slimmable networks and improved training techniques. In International Conference on Computer Vision (ICCV), pages 1803–1811, 2019b.
- Slimmable neural networks. In International Conference on Learning Representations (ICLR), 2018.
- Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In 5th International Conference on Learning Representations, ICLR,, 2017.
- Be your own teacher: Improve the performance of convolutional neural networks via self distillation. In International Conference on Computer Vision (ICCV), pages 3713–3722, 2019.
- Dynamic resolution network. In Conference on Neural Information Processing Systems (NeurIPS), pages 10985–10998, 2021.