CLIP-SPM: Multimodal Subset Selection

Updated 29 December 2025

CLIP-SPM is a framework that utilizes a frozen CLIP backbone with lightweight adapters to bridge domain gaps and perform robust data subset selection.
It employs dual scoring strategies—semantic alignment measuring cosine similarity and diversity assessing intra-class variance—to filter noise and redundancy.
The framework optimizes a multi-objective loss via SGD, achieving superior performance over traditional single-modality and data-centric selection baselines.

CLIP-SPM (CLIP-powered Sample Selection via Multi-objective optimization) is a framework designed to leverage pre-trained multimodal vision-LLMs—specifically, CLIP—for robust, generalizable, and efficient data selection in large-scale labeled datasets. Using a frozen CLIP backbone augmented with lightweight adapters and a differentiable multi-objective optimization protocol, CLIP-SPM identifies the most semantically representative and diverse data samples under a strict cardinality constraint. This approach addresses the redundancy, noise, and domain gaps prevalent in real-world supervised learning scenarios, and consistently outperforms prior single-modality and data-centric selection baselines on standard benchmarks (Yang et al., 2024).

1. Conceptual Overview and Motivation

CLIP-SPM addresses the challenge of curating reduced yet powerful training subsets from large, noisy, and redundant collections. Standard approaches to data selection often rely on single-modality features, which are susceptible to data corruption, mislabelings, and class imbalance, thereby yielding suboptimal or unstable selections. CLIP-SPM exploits CLIP’s joint vision-language embedding space, augmented with dataset-specific alignment through adapters, enabling multimodal reasoning over sample fidelity and diversity.

At a high level, the system sequentially performs (1) dataset adaptation to reduce domain mismatch, (2) per-sample scoring for semantic class fit and diversity, and (3) a global, continuous relaxation of the combinatorial subset selection problem, solved using stochastic gradient descent (SGD) under explicit multi-objective loss constraints.

2. Dataset Adaptation Module

Although the CLIP backbone is pretrained on extensive web-scale data, domain gaps between the pretraining distribution and real-world datasets (e.g., CIFAR’s low-resolution images or bespoke industrial imagery) substantially degrade direct applicability. CLIP-SPM remedies this by inserting dimension-preserving, two-layer MLP adapters $A_I: \mathbb{R}^d \to \mathbb{R}^d$ (image) and $A_T: \mathbb{R}^d \to \mathbb{R}^d$ (text), where $d=512$ , on top of CLIP's frozen encoders. These adapters are fine-tuned for $E=25$ epochs using an InfoNCE-style contrastive loss:

$\mathcal{L}_{\rm adapt} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp\left(\cos(A_I(E_I(I_i)), A_T(E_T(T_i))) / \tau\right)} {\sum_{j=1}^N \exp\left(\cos(A_I(E_I(I_i)), A_T(E_T(T_j))) / \tau\right)}$

with typical hyperparameters: batch size $B=512$ , learning rate $1e^{-4}$ , temperature $\tau=0.07$ . This module ensures that target dataset images and their class-text prompts are well-aligned in the embedding space, without altering the original CLIP backbone (Yang et al., 2024).

3. Sample Scoring Strategies

After adaptation, each sample $(I_i, y_i)$ receives two complementary scores:

Semantic Alignment Score (SAS), $S_{Ai}$ : Quantifies the cosine similarity between the adapted embeddings of each image and its class-conditioned text prompt $T_i =$ “A photo of $[y_i]$ .” A high $S_{Ai}$ signifies a canonical, clean class exemplar; noisy or mislabeled samples yield low SAS.

$S_{Ai} = \cos( A_I(E_I(I_i)), A_T(E_T(T_i)) )$

Sample Diversity Score (SDS), $S_{Di}$ : Measures the average $\ell_2$ distance of the image embedding to its $k \approx 0.1 n_c$ nearest neighbors within its class, with $n_c$ as the class size. This captures local intra-class dispersion, so that highly redundant or non-diverse samples are downweighted.

$S_{Di} = \frac{1}{k} \sum_{j \in \mathrm{KNN}(I_i)} \| A_I(E_I(I_i)) - A_I(E_I(I_j)) \|_2$

Taken together, $(S_{Ai},S_{Di})$ balance representativeness and diversity, robustly filtering out both noise and redundancy.

4. Differentiable Subset Selection Optimization

The goal is to select exactly $k=s_r\,N$ samples out of $N$ total, where $s_r$ is the selection ratio. CLIP-SPM introduces a continuous “selection” vector $\boldsymbol d \in \mathbb{R}^N$ and leverages a sigmoid relaxation $\sigma(d_i)$ , which post-optimization is binarized to discrete selections $\hat d_i$ . The optimization is via SGD with a three-term loss:

Semantic loss (representativeness):

$\mathcal{L}_{sa} = -\frac{1}{N} \sum_{i=1}^N \sigma(d_i) S_{Ai}$

Diversity loss:

$\mathcal{L}_{sd} = -\frac{1}{N} \sum_{i=1}^N \sigma(d_i) S_{Di}$

Selection budget (exact size constraint):

$\mathcal{L}_s = \sqrt{ \left( \frac{1}{N} \sum_{i=1}^N {\rm STE}(\hat d_i) - s_r \right)^2 }$

where ${\rm STE}$ is the straight-through estimator enabling gradient flow through the discrete threshold.

The joint loss is: $\mathcal{L} = \mathcal{L}_{sa} + \alpha\mathcal{L}_{sd} + \beta\mathcal{L}_s$ with $\alpha\approx s_r$ , $\beta=2$ . This relaxation permits efficient, group-aware (i.e., non-myopic), end-to-end selection via standard gradient methods on CPUs/GPUs (Yang et al., 2024).

5. Empirical Evaluation and Ablation Analysis

CLIP-SPM demonstrates consistent improvements over ten strong baselines—including EL2N, GraNd, MoSo, Glister, Herding, CG-Score, SSP, Moderate-DS, Forgetting, and random selection—across diverse image classification datasets.

Quantitative evidence includes:

Tiny-ImageNet (90% selection): 8.13% absolute accuracy lift over all prior methods.
ImageNet-1k (90% selection): Maintains near-lossless top-5 accuracy and shows a 4.4% improvement under 20% label noise.
CIFAR-100 (20% noisy labels): Achieves 45.6% accuracy compared to 35.4% for the next best, successfully filtering 99.8% of noise.
Corrupted images (20% severity): Outperforms noise-agnostic baselines by over 5%.
Cross-architecture transfer: Subsets selected with ResNet-50 generalize seamlessly to ViT-B, Swin-T, VGG-16, DenseNet-121 without loss.
Robustness benchmarks: On ImageNet-A/H/R, CLIP-SPM selected subsets trained models that surpassed full-data training in robustness metrics.

Ablation results confirm that adapters are essential for aligning domain shifts (–2.1% accuracy without adapters), both SAS and SDS are necessary for optimal balance (–0.5% and –0.2% accuracy drops if omitted), and the explicit budget penalty ( $\mathcal{L}_s$ ) yields more effective sample group interactions than greedy thresholding (–1.9% drop without).

CLIP-SPM is distinct from prior data selection and subset optimization methodologies that rely on unimodal representations or simple importance heuristics. In contrast to Selective Vision–Language Subspace Projection (SSP) (Zhu et al., 2024), which aligns features via SVD-based subspace projections but does not conduct selection under hard sample budget constraints or explicit diversity/representativeness tradeoffs, CLIP-SPM is explicitly formulated as a multi-objective selection optimization problem. The joint consideration of semantic fidelity and intra-class diversity directly penalizes both over-represented and noisy clusters, which has not been shown in SSP or earlier works.

SSP can serve as a robust feature alignment strategy for few-shot scenarios, whereas CLIP-SPM is designed to operate in the static subset selection regime for large-scale, potentially label-noisy datasets (Yang et al., 2024, Zhu et al., 2024).

7. Limitations, Extensions, and Significance

A key dependence of CLIP-SPM is on (1) the efficacy of the adapters in bridging the domain gap, and (2) the reliability of the CLIP backbone's joint embedding space post-adaptation. While extensive experimentations demonstrate the robust elimination of label and feature noise, a plausible implication is that highly non-visual tasks or those with extreme class imbalance may require further prompt engineering or structural modifications.

The paradigm of grounding data selection in multimodal, contrastively pre-trained models marks a notable shift towards leveraging large-scale, generalist feature spaces for core data-centric decisions. Future extensions may include integration with continuous data curation pipelines, active learning couplings, or porting to richer multimodal backbones.

In conclusion, CLIP-SPM establishes a principled, scalable, and robust protocol for static supervised data subset selection, uniting the strengths of multimodal supervision, lightweight domain adaptation, and differentiable combinatorial optimization. It enables practitioners to increase model robustness, reduce computational overhead, and ultimately improve the quality and reliability of downstream supervised learning systems (Yang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (2)

A CLIP-Powered Framework for Robust and Generalizable Data Selection (2024)

Selective Vision-Language Subspace Projection for Few-shot CLIP (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to CLIP-SPM Framework.

CLIP-SPM: Multimodal Subset Selection

1. Conceptual Overview and Motivation

2. Dataset Adaptation Module

3. Sample Scoring Strategies

4. Differentiable Subset Selection Optimization

5. Empirical Evaluation and Ablation Analysis

7. Limitations, Extensions, and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

CLIP-SPM: Multimodal Subset Selection

1. Conceptual Overview and Motivation

2. Dataset Adaptation Module

3. Sample Scoring Strategies

4. Differentiable Subset Selection Optimization

5. Empirical Evaluation and Ablation Analysis

6. Comparison to Related Methods

7. Limitations, Extensions, and Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research