MaskCLIP: Masked Self-Distillation Advances Contrastive Language-Image Pretraining (2208.12262v2)

Published 25 Aug 2022 in cs.CV

Abstract: This paper presents a simple yet effective framework MaskCLIP, which incorporates a newly proposed masked self-distillation into contrastive language-image pretraining. The core idea of masked self-distillation is to distill representation from a full image to the representation predicted from a masked image. Such incorporation enjoys two vital benefits. First, masked self-distillation targets local patch representation learning, which is complementary to vision-language contrastive focusing on text-related representation. Second, masked self-distillation is also consistent with vision-language contrastive from the perspective of training objective as both utilize the visual encoder for feature aligning, and thus is able to learn local semantics getting indirect supervision from the language. We provide specially designed experiments with a comprehensive analysis to validate the two benefits. Symmetrically, we also introduce the local semantic supervision into the text branch, which further improves the pretraining performance. With extensive experiments, we show that MaskCLIP, when applied to various challenging downstream tasks, achieves superior results in linear probing, finetuning, and zero-shot performance with the guidance of the language encoder. Code will be release at \url{https://github.com/LightDXY/MaskCLIP}.

PDF Abstract

Overview of Masked Self-Distillation Framework for Vision-Language Pretraining

The paper presents MaskCLIP, a framework that incorporates masked self-distillation into the process of contrastive language-image pretraining. This methodology targets the enhancement of transfer learning capabilities within vision-language (VL) models, particularly by embedding local semantic representations drawn from masked self-distillation, a relatively novel concept in the landscape of image-text representation learning.

Core Contributions

Introduction of Masked Self-Distillation: Unlike traditional VL learning strategies that prioritize global representations, MaskCLIP uses a masked self-distillation mechanism that ensures local patch representation alignment with complete image representations, leveraging an Exponential Moving Average (EMA) model in its workflow. This approach taps into the granular details of images that synergize with high-level semantic information derived from text data.
Local Semantic Supervision in Text Branch: MaskCLIP extends the framework of masked self-distillation to the text branch, drawing directly from traditional mask LLMing methods such as BERT. This innovative use of semantic supervision aims to bolster the model's text encoder, thereby enhancing its zero-shot performance across multiple benchmark tests.
Comprehensive Evaluation: The framework is tested across a spectrum of tasks including ImageNet-1K classification, COCO object detection, ADE20K semantic segmentation, and other classification benchmarks. The results corroborate the efficacy of incorporating local patch learning alongside the traditional VL contrastive approach, yielding notable improvements in linear probing, finetuning, and zero-shot transfer capabilities.

Performance and Implications

MaskCLIP exhibits substantial enhancements in various downstream applications. The method achieves a significant increase in zero-shot classification accuracy by 6.9% on ImageNet-1K over the baseline CLIP, and demonstrates superior finetuning results by margin increments of 1.3%. For zero-shot segmentation on ADE20K, the model delivers a rise in mIoU by 2.7 points over CLIP. These numerical results underline the framework’s robustness in embedding comprehensive image-text semantic learning.

Furthermore, this research positions masked self-distillation as a pivot in the narrative of vision-language pretraining, introducing a mechanism catering to both high-level text-associated entities and the intricate details of image regions. Its implications could be far-reaching, facilitating developments in the creation of more generalized and contextually aware vision-LLMs applicable to diverse domains.

Future Directions

The trajectory of MaskCLIP beckons expansive explorations into multi-modal learning architectures, especially as the demands for fine-grained and adaptive capture of image-text interdependence grow. Possible explorations include the refinement of masked token prediction strategies and ensuring the scalability of the framework on larger and more diverse datasets. The paper underpins a pivotal step towards achieving a nuanced understanding of cross-modality learning, inviting further research into the dynamic symbiosis between visual encoders and semantic textual guides.

Ultimately, MaskCLIP emerges as a substantial contribution to the evolving paradigms of machine learning, marked not only by its methodological innovations but also by its potential application spectrum across AI-driven communication, automation, and contextual comprehension models.

PDF Markdown Bookmark Chat (Pro)

Authors (12)

Xiaoyi Dong (73 papers)
Jianmin Bao (65 papers)
Yinglin Zheng (8 papers)
Ting Zhang (174 papers)
Dongdong Chen (164 papers)
Hao Yang (328 papers)
Ming Zeng (123 papers)
Weiming Zhang (135 papers)
Lu Yuan (130 papers)
Dong Chen (218 papers)
Fang Wen (42 papers)
Nenghai Yu (173 papers)

Citations (122)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - LightDXY/MaskCLIP (35 stars)