Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation (2503.06106v1)

Published 8 Mar 2025 in cs.CV

Abstract: Conventional multi-source domain few-shot adaptation (MFDA) faces the challenge of further reducing the load on edge-side devices in low-resource scenarios. Considering the native language-supervised advantage of CLIP and the plug-and-play nature of prompt to transfer CLIP efficiently, this paper introduces an uploadable multi-source few-shot domain adaptation (UMFDA) schema. It belongs to a decentralized edge collaborative learning in the edge-side models that must maintain a low computational load. And only a limited amount of annotations in source domain data is provided, with most of the data being unannotated. Further, this paper proposes a vision-aware multimodal prompt tuning framework (VAMP) under the decentralized schema, where the vision-aware prompt guides the text domain-specific prompt to maintain semantic discriminability and perceive the domain information. The cross-modal semantic and domain distribution alignment losses optimize each edge-side model, while text classifier consistency and semantic diversity losses promote collaborative learning among edge-side models. Extensive experiments were conducted on OfficeHome and DomainNet datasets to demonstrate the effectiveness of the proposed VAMP in the UMFDA, which outperformed the previous prompt tuning methods.

Summary

The paper introduces VAMP, which efficiently tunes only a small set of prompt parameters to achieve effective multi-source few-shot domain adaptation.
It combines loss functions like CSA, DDA, TCC, and TSD to align semantic and statistical distributions between diverse domains.
Experimental results show average improvements of 3.2% and 1.6% on OfficeHome and DomainNet datasets, outperforming zero-shot CLIP and baseline methods.

Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation

Introduction

The paper introduces "Vision-aware Multimodal Prompt Tuning for Uploadable Multi-source Few-shot Domain Adaptation," emphasizing the need for efficient domain adaptation techniques on decentralized edge devices. Traditional Multi-Source Few-Shot Domain Adaptation (MFDA) techniques often demand substantial resources, limiting their deployment on low-resource devices. This paper proposes an Uploadable Multi-source Few-shot Domain Adaptation (UMFDA) schema to address these constraints and a vision-aware multimodal prompt tuning framework (VAMP).

Figure 1: The illustration of uploadable multi-source few-shot domain adaptation (UMFDA) schema for decentralized edge learning.

Methodology

Vision-aware Multimodal Prompt Tuning Framework

The VAMP framework leverages vision-aware prompts to enhance existing Visual LLMs (VLMs) such as CLIP to adapt efficiently to new domains with minimal resources. This involves tuning a limited number of prompt parameters rather than the entire model, applying a plug-and-play strategy optimal for scenarios involving scarce labeled data. VAMP's design optimizes the vision and language encoders to manage domain-specific information, aligning the semantic space with these prompts.

Figure 2: Summary of various prompt tuning technologies (best viewed in color).

Optimization Strategy

VAMP’s training involves several loss functions: Cross-modal Semantic Alignment (CSA), Domain Distribution Alignment (DDA), Text Classifier Consistency (TCC), and Text Semantic Diversity (TSD). CSA ensures semantic cohesion between the annotated and unannotated domains, DDA aligns the statistical distributions between domains, TCC reduces inter-model discrepancies, and TSD maintains semantic diversity across domain-specific prompts. This holistic optimization ensures effective domain adaptation with minimal computing overhead.

Experiments

VAMP was extensively tested on OfficeHome and DomainNet datasets. Compared to baselines and existing state-of-the-art prompt tuning methods, VAMP demonstrated a significant improvement in adapting to target domains. Zero-shot CLIP and traditional domain-agnostic prompt tuning methods functioned as comparative benchmarks.

Figure 3: t-SNE visualization of the image and text features of target domain extracted by "Clipart-Real World" model of VAMP and DAPL.

Results

VAMP consistently outperformed traditional and prompt-based adaptation methods across multiple domain adaptation scenarios, showcasing an enhanced capacity to maintain semantic discriminability while aligning domain distributions. VAMP achieved an average improvement of 3.2% and 1.6% over zero-shot CLIP inference on the OfficeHome and DomainNet datasets, respectively. These results affirm VAMP's effectiveness, especially in decentralized settings where computing resources are limited. Attention map visualizations further demonstrated VAMP's superior focus on relevant image regions, emphasizing its strong feature discrimination and alignment capabilities.

Conclusion

The proposed VAMP framework provides a robust solution for deploying advanced machine learning models on resource-constrained devices, demonstrating substantial improvements over existing methods. By harnessing the potential of vision-aware prompts, VAMP establishes itself as a viable approach for achieving efficient domain adaptation in the field of decentralized edge computing. This work encourages further exploration into extending the VAMP framework's application across various domain adaptation tasks in constrained environments.