Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification (2412.20110v3)

Published 28 Dec 2024 in cs.CV

Abstract: Few-shot image classification remains a critical challenge in the field of computer vision, particularly in data-scarce environments. Existing methods typically rely on pre-trained visual-LLMs, such as CLIP. However, due to the modality gap, which is the inconsistent distribution of image and text features in the joint embedding space, directly using these features as class prototypes often leads to suboptimal performance. To address this issue, we propose a novel Cross-Modal Mapping (CMM) method. This method globally aligns image features with the text feature space through linear transformation and optimizes their local spatial relationships using triplet loss, thereby significantly enhancing cross-modal consistency. Experimental results show that compared to other methods, CMM simplifies the training process and demonstrates higher efficiency. Furthermore, CMM improves the average Top-1 accuracy by 1.06% on 11 benchmark datasets compared to methods that partially fine-tune the backbone, and it performs excellently on 4 distribution shift datasets. Notably, CMM effectively mitigates the modality gap in pre-trained models, enabling text features to serve as effective class prototypes for image features, thus providing an efficient and highly generalizable solution for few-shot learning.

Collections

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Generate Now

Cross-Modal Mapping: Mitigating the Modality Gap for Few-Shot Image Classification (2412.20110v3)

Collections

Summary

Follow-up Questions

Related Papers

Authors (5)