Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation (2205.14141v3)

Published 27 May 2022 in cs.CV and cs.LG

Abstract: Masked image modeling (MIM) learns representations with remarkably good fine-tuning performances, overshadowing previous prevalent pre-training approaches such as image classification, instance contrastive learning, and image-text alignment. In this paper, we show that the inferior fine-tuning performance of these pre-training approaches can be significantly improved by a simple post-processing in the form of feature distillation (FD). The feature distillation converts the old representations to new representations that have a few desirable properties just like those representations produced by MIM. These properties, which we aggregately refer to as optimization friendliness, are identified and analyzed by a set of attention- and optimization-related diagnosis tools. With these properties, the new representations show strong fine-tuning performance. Specifically, the contrastive self-supervised learning methods are made as competitive in fine-tuning as the state-of-the-art masked image modeling (MIM) algorithms. The CLIP models' fine-tuning performance is also significantly improved, with a CLIP ViT-L model reaching 89.0% top-1 accuracy on ImageNet-1K classification. On the 3-billion-parameter SwinV2-G model, the fine-tuning accuracy is improved by +1.5 mIoU / +1.1 mAP to 61.4 mIoU / 64.2 mAP on ADE20K semantic segmentation and COCO object detection, respectively, creating new records on both benchmarks. More importantly, our work provides a way for the future research to focus more effort on the generality and scalability of the learnt representations without being pre-occupied with optimization friendliness since it can be enhanced rather easily. The code will be available at https://github.com/SwinTransformer/Feature-Distillation.

Citations (117)

View on Semantic Scholar

Summary

The paper presents feature distillation as a novel post-processing step that transforms model representations to achieve optimization-friendly properties, rivaling MIM techniques.
It reports significant improvements, including 89.0% top-1 accuracy on ImageNet-1K with a ViT-L model and gains in ADE20K segmentation and COCO detection metrics.
The approach decouples representation generality from fine-tuning performance, paving the way for scalable and adaptable AI systems across diverse architectures.

Evaluating the Role of Feature Distillation in Enhancing Vision Representations

In a significant contribution to the field of computer vision and representation learning, the paper titled "Contrastive Learning Rivals Masked Image Modeling in Fine-tuning via Feature Distillation" introduces a novel approach that challenges the prevailing dominance of masked image modeling (MIM) techniques in fine-tuning performance. The authors present an elegant solution termed feature distillation (FD), which acts as a post-processing step to significantly enhance the fine-tuning capabilities of previously established pre-training methods, such as contrastive learning and visual-text alignment, bringing their performance on par with or exceeding that of MIM.

Core Contributions and Numerical Findings

This work notably improves the fine-tuning performance of models trained via diverse pre-training approaches by transforming their representations to exhibit optimization-friendly properties akin to those produced by MIM. These properties are systematically diagnosed using attention- and optimization-related tools. The transformed representations demonstrated substantial enhancements in fine-tuning tasks, as evidenced by significant improvements in benchmarks such as ImageNet-1K classification and ADE20K semantic segmentation.

Key numerical results include:

The CLIP visual-LLM's fine-tuning performance saw dramatic gains, with a ViT-L model achieving an impressive 89.0% top-1 accuracy on ImageNet-1K.
Similarly, a +1.5 mIoU and +1.1 mAP improvement was achieved on the 3-billion-parameter SwinV2-G model for ADE20K semantic segmentation and COCO object detection, respectively.

Implications for Representation Learning

The research presented suggests that feature distillation can serve as a crucial tool in converting heterogeneous learned representations into forms that bolster their adaptability and fine-tuning potential. By utilizing diverse attention-related properties and optimization diagnostics, the distilled features are fine-tuned more effectively, allowing for significant gains without altering the core pre-training paradigms.

The implications are profound, as this approach can decouple the goals of optimizing friendly representations from those of generality and scalability, opening a pathway to focusing efforts on broadening and scaling learned representations without the traditional emphasis on enhancing their optimization-friendly attributes during pre-training.

Future Trajectories and Scalable AI

Speculating on future directions, this work paves the way for exploring similar distillation techniques across a broader range of neural architectures and tasks. One intriguing avenue is the potential application of feature distillation to scale large models without compromising their capacity to generalize from vast datasets, a cornerstone in advancing scalable AI systems. With ongoing efforts to integrate larger architectures and more diverse data modalities, leveraging the principles of feature distillation might result in further efficiency gains, facilitating quicker adaptation and enhanced performance across diverse downstream tasks.

In summary, this paper offers a methodologically sound intervention in the competitive landscape of representation learning, underscoring the practicability and effectiveness of feature distillation as a transformative process in computer vision tasks. Assuredly, this will elicit further research and adoption, promoting innovation in how learned representations are harnessed and optimized for various applications.

PDF Markdown

Related Papers

GitHub

GitHub - SwinTransformer/Feature-Distillation (243 stars)