SILC: Improving Vision Language Pretraining with Self-Distillation (2310.13355v2)

Published 20 Oct 2023 in cs.CV

Abstract: Image-Text pretraining on web-scale image caption datasets has become the default recipe for open vocabulary classification and retrieval models thanks to the success of CLIP and its variants. Several works have also used CLIP features for dense prediction tasks and have shown the emergence of open-set abilities. However, the contrastive objective used by these models only focuses on image-text alignment and does not incentivise image feature learning for dense prediction tasks. In this work, we introduce SILC, a novel framework for vision language pretraining. SILC improves image-text contrastive learning with the simple addition of local-to-global correspondence learning by self-distillation. We show that distilling local image features from an exponential moving average (EMA) teacher model significantly improves model performance on dense predictions tasks like detection and segmentation, while also providing improvements on image-level tasks such as classification and retrieval. SILC models sets a new state of the art for zero-shot classification, few shot classification, image and text retrieval, zero-shot segmentation, and open vocabulary segmentation. We further show that SILC features greatly benefit open vocabulary detection, captioning and visual question answering.

PDF Abstract

Insights into Vision Language Pretraining with Self-Distillation: An Overview of the SILC Framework

The paper introduces SILC, an innovative framework aimed at enhancing vision language pretraining by incorporating self-distillation into the contrastive learning paradigm. Vision-LLMs (VLM) have predominantly relied on image-text pretraining on large-scale web datasets for broad classification and retrieval tasks. Traditionally, contrastive models, such as CLIP variants, have excelled in aligning images and texts in a latent space yet lack a direct mechanism to enhance feature learning necessary for dense prediction tasks. SILC addresses this limitation by embedding local-to-global correspondence learning through self-distillation, thereby improving both image-text contrastive learning and local feature learning in images.

Core Methodology and Contributions

SILC stands on two foundational training objectives:

Image-Text Contrastive Learning: Adhering to the established contrastive pretraining methods, SILC utilizes large datasets to align image-text pairs within a shared embedding space. This alignment leverages the Info-NCE framework, optimizing models to group matching pairs in close proximity while distancing dissimilar pairs.
Self-Distillation with Local Features: SILC introduces a novel self-distillation process for images. This method enriches local feature representations by promoting consistency between local views seen by the student model and global views processed by an exponential moving average (EMA)-based teacher model. The teacher's function is to guide the learning process without collapse through temperature sharpening and centering, yielding superior local segment understanding that benefits dense vision tasks.

The paper articulates that this dual-objective approach permits simultaneous improvement in tasks demanding both global image comprehension and local pixel-level understanding. SILC thus showcases notable advancements in various areas:

State-of-the-art results in zero-shot and few-shot classification, image and text retrieval.
Enhanced performance in zero-shot segmentation, surpassing existing benchmarks and establishing a new standard through effective labeling of semantic classes without explicit supervision.
Significant improvements in open vocabulary segmentation and detection tasks achieved by integrating SILC's pretrained features into established frameworks.

Experimental Validation and Results

The efficacy of SILC is validated through extensive experimentation against baseline CLIP and SigLIP architectures. It demonstrates its superiority across a variety of datasets and tasks, offering up to several mIOU points enhancement in zero-shot segmentation of complex datasets such as ADE-150 and Pascal Context. The pretraining process engaged in comparative analyses using ViT/B16 scaled models, confirming SILC's consistent boost in performance through fair assessments.

In the open vocabulary domain, the results are equally convincing. Utilizing CAT-Seg and OWLv2 frameworks, SILC not only improved segmentation accuracy across challenging test sets but also outpaced larger model counterparts like CLIP-G/14 in semantic understanding accuracy. The gains extend to visual question answering and captioning tasks, proving the approach's utility in deriving contextual language cues from visual localities.

Implications for Vision-Language Research

SILC's framework paves the way for a profound understanding of integrating self-supervised objectives within vision-LLMs, capitalizing on unannotated data to cultivate rich semantic knowledge. This has broad implications:

Facilitating a paradigm shift in pretraining strategies to incorporate dual-layer objectives, efficiently augmenting both image-level and pixel-level tasks.
Providing a scalable methodology applicable to newer, diverse datasets beyond those examined, allowing continual improvement in emerging object recognition challenges.
Enabling advancements in language-grounded understanding of images, thereby setting the stage for future explorations into more intricate multi-modal AI systems.

The introduction of SILC suggests a promising trajectory toward integrating more nuanced forms of self-regulation within VLMs. By engaging local image features more constructively, the research not only impacts immediate model performance but also extends the conceptual framework by which AI systems interpret and interact with the world, connecting visual details with contextual narratives. The results presented in this paper encourage further innovation in refining vision-language association techniques, contributing to the horizon of adaptive intelligent systems.