Are Large Kernels Better Teachers than Transformers for ConvNets? (2305.19412v1)

Published 30 May 2023 in cs.CV and cs.AI

Abstract: This paper reveals a new appeal of the recently emerged large-kernel Convolutional Neural Networks (ConvNets): as the teacher in Knowledge Distillation (KD) for small-kernel ConvNets. While Transformers have led state-of-the-art (SOTA) performance in various fields with ever-larger models and labeled data, small-kernel ConvNets are considered more suitable for resource-limited applications due to the efficient convolution operation and compact weight sharing. KD is widely used to boost the performance of small-kernel ConvNets. However, previous research shows that it is not quite effective to distill knowledge (e.g., global information) from Transformers to small-kernel ConvNets, presumably due to their disparate architectures. We hereby carry out a first-of-its-kind study unveiling that modern large-kernel ConvNets, a compelling competitor to Vision Transformers, are remarkably more effective teachers for small-kernel ConvNets, due to more similar architectures. Our findings are backed up by extensive experiments on both logit-level and feature-level KD ``out of the box", with no dedicated architectural nor training recipe modifications. Notably, we obtain the \textbf{best-ever pure ConvNet} under 30M parameters with \textbf{83.1\%} top-1 accuracy on ImageNet, outperforming current SOTA methods including ConvNeXt V2 and Swin V2. We also find that beneficial characteristics of large-kernel ConvNets, e.g., larger effective receptive fields, can be seamlessly transferred to students through this large-to-small kernel distillation. Code is available at: \url{https://github.com/VITA-Group/SLaK}.

Authors (8)

Tianjin Huang (29 papers)
Lu Yin (86 papers)
Zhenyu Zhang (250 papers)
Li Shen (363 papers)
Meng Fang (100 papers)
Mykola Pechenizkiy (118 papers)
Zhangyang Wang (375 papers)
Shiwei Liu (76 papers)

Citations (12)

View on Semantic Scholar

Summary

The paper demonstrates that large-kernel ConvNets serve as effective teachers for small-kernel models in knowledge distillation.
It leverages architectural compatibility and effective receptive fields to enable seamless knowledge transfer.
Experimental results reveal that distilled small-kernel ConvNets can reach 83.1% top-1 accuracy on ImageNet, showing efficiency in resource-constrained settings.

Analyzing the Role of Large-Kernel ConvNets as Teachers in Knowledge Distillation for Small-Kernel ConvNets

This paper investigates a novel approach within the domain of Knowledge Distillation (KD), emphasizing the potential of large-kernel Convolutional Neural Networks (ConvNets) as superior teachers compared to state-of-the-art Vision Transformers. The focus is on enhancing the performance of small-kernel ConvNets in resource-constrained settings.

Key Findings and Methodology

Large-Kernel ConvNets as Effective Teachers: Through systematic experiments, the paper provides evidence that large-kernel ConvNets, like SLaK and ConvNeXt, outperform Vision Transformers in KD tasks for small-kernel ConvNets.
Architectural Compatibility: One core reason for the effectiveness of large-kernel ConvNets as teachers is their architectural similarity to small-kernel ConvNets, which allows seamless knowledge transfer, particularly in terms of effective receptive fields (ERF) and feature representation.
Experimental Validation: The research was conducted with computational rigor, employing a 120-epoch KD training period and further extending it to 300 epochs to solidify findings. The paper tested both logit and feature-level KD, showing consistent improvements in student model performance across various metrics, including top-1 accuracy.

Numerical Results and Implications

The paper achieved a milestone by obtaining a pure ConvNet model under 30M parameters that reached 83.1% top-1 accuracy on the ImageNet dataset, surpassing benchmarks set by models such as ConvNeXt V2 and Swin V2.
Statistical gains were underscored by Direct Gain and Effective Gain metrics, where the distilled models, even at reduced training epochs, equaled the performance of densely trained counterparts.

Implications for AI Development

Efficient Model Deployment: The findings have direct implications for deploying efficient neural network architectures in environments with limited resources, endorsing small-kernel ConvNets trained through KD as viable alternatives to larger, resource-intensive models.
Enhanced Robustness: Beyond accuracy, models distilled from large-kernel ConvNets showed enhanced robustness on diverse ImageNet benchmarks, a critical factor for real-world applications where Generalization remains a pivotal challenge.

Future Prospects

Refinement of Distillation Techniques: Given the promising results, future work could explore refining KD methodologies with more sophisticated feature alignment mechanisms, leveraging the architectural compatibility between kernel-based models.
Transferability of Findings: Extending these principles to other domains such as audio processing or non-visual sensory data could open new avenues for research and application, potentially improving resource-constrained AI systems across sectors.

Conclusion

This paper rigorously investigates and validates the potential of large-kernel ConvNets as superior educators for small-kernel counterparts in the KD paradigm. The research provides a comprehensive foundation for strategic model selection in constrained environments, maintaining high performance without expanding model size. Such insights pave the way for further explorations into the efficient distillation and deployment of neural networks, enhancing the efficacy of AI solutions in diverse practical scenarios.

PDF Markdown

Related Papers

GitHub

GitHub - VITA-Group/SLaK: [ICLR 2023] "More ConvNets in the 2020s: Scaling up Kernels Beyond 51x51 using Sparsity"; [ICML 2023] "Are Large Kernels Better Teachers than Transformers for ConvNets?" (276 stars)