MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

Published 15 May 2025 in cs.CV | (2505.10289v1)

Abstract: Compositional Zero-Shot Learning (CZSL) aims to recognize unseen state-object combinations by leveraging known combinations. Existing studies basically rely on the cross-modal alignment capabilities of CLIP but tend to overlook its limitations in capturing fine-grained local features, which arise from its architectural and training paradigm. To address this issue, we propose a Multi-Stage Cross-modal Interaction (MSCI) model that effectively explores and utilizes intermediate-layer information from CLIP's visual encoder. Specifically, we design two self-adaptive aggregators to extract local information from low-level visual features and integrate global information from high-level visual features, respectively. These key information are progressively incorporated into textual representations through a stage-by-stage interaction mechanism, significantly enhancing the model's perception capability for fine-grained local visual information. Additionally, MSCI dynamically adjusts the attention weights between global and local visual information based on different combinations, as well as different elements within the same combination, allowing it to flexibly adapt to diverse scenarios. Experiments on three widely used datasets fully validate the effectiveness and superiority of the proposed model. Data and code are available at https://github.com/ltpwy/MSCI.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning

The paper "MSCI: Addressing CLIP's Inherent Limitations for Compositional Zero-Shot Learning" presents a novel approach to enhance the performance of Compositional Zero-Shot Learning (CZSL) tasks by addressing architectural limitations associated with CLIP's visual encoder.

In CZSL, the objective is to predict unseen state-object combinations using learned knowledge from seen combinations. Current state-of-the-art models typically utilize CLIP's strong cross-modal alignment capabilities. However, these models often falter in effectively capturing fine-grained local features due to the inherent architectural designs of the CLIP's visual encoder and its contrastive learning paradigm. The proposed solution is the Multi-Stage Cross-modal Interaction (MSCI) model, which seeks to expand CLIP's feature extraction capabilities through a strategic multi-layer approach.

Multi-Stage Cross-modal Interaction Model

The MSCI model introduces a two-stage interaction process that explores intermediate-layer information from CLIP's visual encoder. This approach distinguishes itself by integrating both local and global information from various layers of CLIP's visual features, enhancing the model's ability to recognize fine-grained local details that are pivotal in CZSL tasks.

Core Components of MSCI

Multi-Layer Feature Aggregation: MSCI intelligently combines information from multiple layers of CLIP's visual encoder. The lower layers offer rich local details, while the higher layers integrate global abstract features. This multi-stage feature extraction is mediated by trainable feature aggregators designed to absorb more detailed visual information that is typically compressed into global summaries in standard CLIP implementations.
Stage-wise Cross-modal Interaction: Using a two-stage mechanism, MSCI first incorporates local low-level details, and then merges high-level global information into the prompt embeddings. Including a dual-residual connection with cross-attention allows the flexible fusion of multi-layer visual information with text embeddings, significantly enhancing fine-grained perception capabilities.
Dynamic Feature Fusion: By introducing learnable parameters, MSCI can dynamically adjust the balance between local and global information depending on the task at hand. This adaptability is crucial in enabling the model to better understand and predict unseen combinations.

Experimental Validation

The paper presents extensive experiments on three widely-used datasets: MIT-States, UT-Zappos, and C-GQA, under both closed- and open-world settings. MSCI demonstrates superior performance, achieving state-of-the-art results across these benchmarks—particularly highlighted by improvements in the area under the curve (AUC) metric.

In closed-world settings, MSCI outperformed existing models by enhancements of up to 14.5% on C-GQA's AUC metric. These results indicate MSCI's proficient ability to integrate detailed features across layers, significantly benefiting tasks that demand nuanced differentiation of visual components.

Implications and Future Directions

The implications of this research are profound in terms of both immediate applications and the theoretical understanding of CZSL. The MSCI model achieves significant advancements by compensating for the limitations in CLIP's architecture, pushing the boundaries of zero-shot learning paradigms. Practically, these improvements can facilitate more effective applications in fields like autonomous driving, where understanding compositional objects in unseen scenarios is crucial.

From a theoretical standpoint, this work underscores the importance of considering fine-grained feature interactions in model architectures that are typically designed for broad cross-modal tasks. MSCI's success may prompt further exploration into modifying existing models like CLIP for more specialized tasks through adaptive interaction frameworks.

Moving forward, MSCI's approach could inspire future research towards more even granular control over feature aggregation and interaction strategies within neural frameworks. This has the potential to significantly advance not only CZSL tasks but also other complex vision-language tasks that require nuanced comprehension of both visual and textual modalities.