Papers
Topics
Authors
Recent
2000 character limit reached

CLIP-SPM: Multimodal Subset Selection

Updated 29 December 2025
  • CLIP-SPM is a framework that utilizes a frozen CLIP backbone with lightweight adapters to bridge domain gaps and perform robust data subset selection.
  • It employs dual scoring strategies—semantic alignment measuring cosine similarity and diversity assessing intra-class variance—to filter noise and redundancy.
  • The framework optimizes a multi-objective loss via SGD, achieving superior performance over traditional single-modality and data-centric selection baselines.

CLIP-SPM (CLIP-powered Sample Selection via Multi-objective optimization) is a framework designed to leverage pre-trained multimodal vision-LLMs—specifically, CLIP—for robust, generalizable, and efficient data selection in large-scale labeled datasets. Using a frozen CLIP backbone augmented with lightweight adapters and a differentiable multi-objective optimization protocol, CLIP-SPM identifies the most semantically representative and diverse data samples under a strict cardinality constraint. This approach addresses the redundancy, noise, and domain gaps prevalent in real-world supervised learning scenarios, and consistently outperforms prior single-modality and data-centric selection baselines on standard benchmarks (Yang et al., 15 Oct 2024).

1. Conceptual Overview and Motivation

CLIP-SPM addresses the challenge of curating reduced yet powerful training subsets from large, noisy, and redundant collections. Standard approaches to data selection often rely on single-modality features, which are susceptible to data corruption, mislabelings, and class imbalance, thereby yielding suboptimal or unstable selections. CLIP-SPM exploits CLIP’s joint vision-language embedding space, augmented with dataset-specific alignment through adapters, enabling multimodal reasoning over sample fidelity and diversity.

At a high level, the system sequentially performs (1) dataset adaptation to reduce domain mismatch, (2) per-sample scoring for semantic class fit and diversity, and (3) a global, continuous relaxation of the combinatorial subset selection problem, solved using stochastic gradient descent (SGD) under explicit multi-objective loss constraints.

2. Dataset Adaptation Module

Although the CLIP backbone is pretrained on extensive web-scale data, domain gaps between the pretraining distribution and real-world datasets (e.g., CIFAR’s low-resolution images or bespoke industrial imagery) substantially degrade direct applicability. CLIP-SPM remedies this by inserting dimension-preserving, two-layer MLP adapters AI:RdRdA_I: \mathbb{R}^d \to \mathbb{R}^d (image) and AT:RdRdA_T: \mathbb{R}^d \to \mathbb{R}^d (text), where d=512d=512, on top of CLIP's frozen encoders. These adapters are fine-tuned for E=25E=25 epochs using an InfoNCE-style contrastive loss:

Ladapt=1Ni=1Nlogexp(cos(AI(EI(Ii)),AT(ET(Ti)))/τ)j=1Nexp(cos(AI(EI(Ii)),AT(ET(Tj)))/τ)\mathcal{L}_{\rm adapt} = -\frac{1}{N} \sum_{i=1}^N \log \frac{\exp\left(\cos(A_I(E_I(I_i)), A_T(E_T(T_i))) / \tau\right)} {\sum_{j=1}^N \exp\left(\cos(A_I(E_I(I_i)), A_T(E_T(T_j))) / \tau\right)}

with typical hyperparameters: batch size B=512B=512, learning rate 1e41e^{-4}, temperature τ=0.07\tau=0.07. This module ensures that target dataset images and their class-text prompts are well-aligned in the embedding space, without altering the original CLIP backbone (Yang et al., 15 Oct 2024).

3. Sample Scoring Strategies

After adaptation, each sample (Ii,yi)(I_i, y_i) receives two complementary scores:

  • Semantic Alignment Score (SAS), SAiS_{Ai}: Quantifies the cosine similarity between the adapted embeddings of each image and its class-conditioned text prompt Ti=T_i = “A photo of [yi][y_i].” A high SAiS_{Ai} signifies a canonical, clean class exemplar; noisy or mislabeled samples yield low SAS.

SAi=cos(AI(EI(Ii)),AT(ET(Ti)))S_{Ai} = \cos( A_I(E_I(I_i)), A_T(E_T(T_i)) )

  • Sample Diversity Score (SDS), SDiS_{Di}: Measures the average 2\ell_2 distance of the image embedding to its k0.1nck \approx 0.1 n_c nearest neighbors within its class, with ncn_c as the class size. This captures local intra-class dispersion, so that highly redundant or non-diverse samples are downweighted.

SDi=1kjKNN(Ii)AI(EI(Ii))AI(EI(Ij))2S_{Di} = \frac{1}{k} \sum_{j \in \mathrm{KNN}(I_i)} \| A_I(E_I(I_i)) - A_I(E_I(I_j)) \|_2

Taken together, (SAi,SDi)(S_{Ai},S_{Di}) balance representativeness and diversity, robustly filtering out both noise and redundancy.

4. Differentiable Subset Selection Optimization

The goal is to select exactly k=srNk=s_r\,N samples out of NN total, where srs_r is the selection ratio. CLIP-SPM introduces a continuous “selection” vector dRN\boldsymbol d \in \mathbb{R}^N and leverages a sigmoid relaxation σ(di)\sigma(d_i), which post-optimization is binarized to discrete selections d^i\hat d_i. The optimization is via SGD with a three-term loss:

  • Semantic loss (representativeness):

Lsa=1Ni=1Nσ(di)SAi\mathcal{L}_{sa} = -\frac{1}{N} \sum_{i=1}^N \sigma(d_i) S_{Ai}

  • Diversity loss:

Lsd=1Ni=1Nσ(di)SDi\mathcal{L}_{sd} = -\frac{1}{N} \sum_{i=1}^N \sigma(d_i) S_{Di}

  • Selection budget (exact size constraint):

Ls=(1Ni=1NSTE(d^i)sr)2\mathcal{L}_s = \sqrt{ \left( \frac{1}{N} \sum_{i=1}^N {\rm STE}(\hat d_i) - s_r \right)^2 }

where STE{\rm STE} is the straight-through estimator enabling gradient flow through the discrete threshold.

The joint loss is: L=Lsa+αLsd+βLs\mathcal{L} = \mathcal{L}_{sa} + \alpha\mathcal{L}_{sd} + \beta\mathcal{L}_s with αsr\alpha\approx s_r, β=2\beta=2. This relaxation permits efficient, group-aware (i.e., non-myopic), end-to-end selection via standard gradient methods on CPUs/GPUs (Yang et al., 15 Oct 2024).

5. Empirical Evaluation and Ablation Analysis

CLIP-SPM demonstrates consistent improvements over ten strong baselines—including EL2N, GraNd, MoSo, Glister, Herding, CG-Score, SSP, Moderate-DS, Forgetting, and random selection—across diverse image classification datasets.

Quantitative evidence includes:

  • Tiny-ImageNet (90% selection): 8.13% absolute accuracy lift over all prior methods.
  • ImageNet-1k (90% selection): Maintains near-lossless top-5 accuracy and shows a 4.4% improvement under 20% label noise.
  • CIFAR-100 (20% noisy labels): Achieves 45.6% accuracy compared to 35.4% for the next best, successfully filtering 99.8% of noise.
  • Corrupted images (20% severity): Outperforms noise-agnostic baselines by over 5%.
  • Cross-architecture transfer: Subsets selected with ResNet-50 generalize seamlessly to ViT-B, Swin-T, VGG-16, DenseNet-121 without loss.
  • Robustness benchmarks: On ImageNet-A/H/R, CLIP-SPM selected subsets trained models that surpassed full-data training in robustness metrics.

Ablation results confirm that adapters are essential for aligning domain shifts (–2.1% accuracy without adapters), both SAS and SDS are necessary for optimal balance (–0.5% and –0.2% accuracy drops if omitted), and the explicit budget penalty (Ls\mathcal{L}_s) yields more effective sample group interactions than greedy thresholding (–1.9% drop without).

CLIP-SPM is distinct from prior data selection and subset optimization methodologies that rely on unimodal representations or simple importance heuristics. In contrast to Selective Vision–Language Subspace Projection (SSP) (Zhu et al., 24 Jul 2024), which aligns features via SVD-based subspace projections but does not conduct selection under hard sample budget constraints or explicit diversity/representativeness tradeoffs, CLIP-SPM is explicitly formulated as a multi-objective selection optimization problem. The joint consideration of semantic fidelity and intra-class diversity directly penalizes both over-represented and noisy clusters, which has not been shown in SSP or earlier works.

SSP can serve as a robust feature alignment strategy for few-shot scenarios, whereas CLIP-SPM is designed to operate in the static subset selection regime for large-scale, potentially label-noisy datasets (Yang et al., 15 Oct 2024, Zhu et al., 24 Jul 2024).

7. Limitations, Extensions, and Significance

A key dependence of CLIP-SPM is on (1) the efficacy of the adapters in bridging the domain gap, and (2) the reliability of the CLIP backbone's joint embedding space post-adaptation. While extensive experimentations demonstrate the robust elimination of label and feature noise, a plausible implication is that highly non-visual tasks or those with extreme class imbalance may require further prompt engineering or structural modifications.

The paradigm of grounding data selection in multimodal, contrastively pre-trained models marks a notable shift towards leveraging large-scale, generalist feature spaces for core data-centric decisions. Future extensions may include integration with continuous data curation pipelines, active learning couplings, or porting to richer multimodal backbones.

In conclusion, CLIP-SPM establishes a principled, scalable, and robust protocol for static supervised data subset selection, uniting the strengths of multimodal supervision, lightweight domain adaptation, and differentiable combinatorial optimization. It enables practitioners to increase model robustness, reduce computational overhead, and ultimately improve the quality and reliability of downstream supervised learning systems (Yang et al., 15 Oct 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CLIP-SPM Framework.