CLIP-SPM: Multimodal Subset Selection
- CLIP-SPM is a framework that utilizes a frozen CLIP backbone with lightweight adapters to bridge domain gaps and perform robust data subset selection.
- It employs dual scoring strategies—semantic alignment measuring cosine similarity and diversity assessing intra-class variance—to filter noise and redundancy.
- The framework optimizes a multi-objective loss via SGD, achieving superior performance over traditional single-modality and data-centric selection baselines.
CLIP-SPM (CLIP-powered Sample Selection via Multi-objective optimization) is a framework designed to leverage pre-trained multimodal vision-LLMs—specifically, CLIP—for robust, generalizable, and efficient data selection in large-scale labeled datasets. Using a frozen CLIP backbone augmented with lightweight adapters and a differentiable multi-objective optimization protocol, CLIP-SPM identifies the most semantically representative and diverse data samples under a strict cardinality constraint. This approach addresses the redundancy, noise, and domain gaps prevalent in real-world supervised learning scenarios, and consistently outperforms prior single-modality and data-centric selection baselines on standard benchmarks (Yang et al., 15 Oct 2024).
1. Conceptual Overview and Motivation
CLIP-SPM addresses the challenge of curating reduced yet powerful training subsets from large, noisy, and redundant collections. Standard approaches to data selection often rely on single-modality features, which are susceptible to data corruption, mislabelings, and class imbalance, thereby yielding suboptimal or unstable selections. CLIP-SPM exploits CLIP’s joint vision-language embedding space, augmented with dataset-specific alignment through adapters, enabling multimodal reasoning over sample fidelity and diversity.
At a high level, the system sequentially performs (1) dataset adaptation to reduce domain mismatch, (2) per-sample scoring for semantic class fit and diversity, and (3) a global, continuous relaxation of the combinatorial subset selection problem, solved using stochastic gradient descent (SGD) under explicit multi-objective loss constraints.
2. Dataset Adaptation Module
Although the CLIP backbone is pretrained on extensive web-scale data, domain gaps between the pretraining distribution and real-world datasets (e.g., CIFAR’s low-resolution images or bespoke industrial imagery) substantially degrade direct applicability. CLIP-SPM remedies this by inserting dimension-preserving, two-layer MLP adapters (image) and (text), where , on top of CLIP's frozen encoders. These adapters are fine-tuned for epochs using an InfoNCE-style contrastive loss:
with typical hyperparameters: batch size , learning rate , temperature . This module ensures that target dataset images and their class-text prompts are well-aligned in the embedding space, without altering the original CLIP backbone (Yang et al., 15 Oct 2024).
3. Sample Scoring Strategies
After adaptation, each sample receives two complementary scores:
- Semantic Alignment Score (SAS), : Quantifies the cosine similarity between the adapted embeddings of each image and its class-conditioned text prompt “A photo of .” A high signifies a canonical, clean class exemplar; noisy or mislabeled samples yield low SAS.
- Sample Diversity Score (SDS), : Measures the average distance of the image embedding to its nearest neighbors within its class, with as the class size. This captures local intra-class dispersion, so that highly redundant or non-diverse samples are downweighted.
Taken together, balance representativeness and diversity, robustly filtering out both noise and redundancy.
4. Differentiable Subset Selection Optimization
The goal is to select exactly samples out of total, where is the selection ratio. CLIP-SPM introduces a continuous “selection” vector and leverages a sigmoid relaxation , which post-optimization is binarized to discrete selections . The optimization is via SGD with a three-term loss:
- Semantic loss (representativeness):
- Diversity loss:
- Selection budget (exact size constraint):
where is the straight-through estimator enabling gradient flow through the discrete threshold.
The joint loss is: with , . This relaxation permits efficient, group-aware (i.e., non-myopic), end-to-end selection via standard gradient methods on CPUs/GPUs (Yang et al., 15 Oct 2024).
5. Empirical Evaluation and Ablation Analysis
CLIP-SPM demonstrates consistent improvements over ten strong baselines—including EL2N, GraNd, MoSo, Glister, Herding, CG-Score, SSP, Moderate-DS, Forgetting, and random selection—across diverse image classification datasets.
Quantitative evidence includes:
- Tiny-ImageNet (90% selection): 8.13% absolute accuracy lift over all prior methods.
- ImageNet-1k (90% selection): Maintains near-lossless top-5 accuracy and shows a 4.4% improvement under 20% label noise.
- CIFAR-100 (20% noisy labels): Achieves 45.6% accuracy compared to 35.4% for the next best, successfully filtering 99.8% of noise.
- Corrupted images (20% severity): Outperforms noise-agnostic baselines by over 5%.
- Cross-architecture transfer: Subsets selected with ResNet-50 generalize seamlessly to ViT-B, Swin-T, VGG-16, DenseNet-121 without loss.
- Robustness benchmarks: On ImageNet-A/H/R, CLIP-SPM selected subsets trained models that surpassed full-data training in robustness metrics.
Ablation results confirm that adapters are essential for aligning domain shifts (–2.1% accuracy without adapters), both SAS and SDS are necessary for optimal balance (–0.5% and –0.2% accuracy drops if omitted), and the explicit budget penalty () yields more effective sample group interactions than greedy thresholding (–1.9% drop without).
6. Comparison to Related Methods
CLIP-SPM is distinct from prior data selection and subset optimization methodologies that rely on unimodal representations or simple importance heuristics. In contrast to Selective Vision–Language Subspace Projection (SSP) (Zhu et al., 24 Jul 2024), which aligns features via SVD-based subspace projections but does not conduct selection under hard sample budget constraints or explicit diversity/representativeness tradeoffs, CLIP-SPM is explicitly formulated as a multi-objective selection optimization problem. The joint consideration of semantic fidelity and intra-class diversity directly penalizes both over-represented and noisy clusters, which has not been shown in SSP or earlier works.
SSP can serve as a robust feature alignment strategy for few-shot scenarios, whereas CLIP-SPM is designed to operate in the static subset selection regime for large-scale, potentially label-noisy datasets (Yang et al., 15 Oct 2024, Zhu et al., 24 Jul 2024).
7. Limitations, Extensions, and Significance
A key dependence of CLIP-SPM is on (1) the efficacy of the adapters in bridging the domain gap, and (2) the reliability of the CLIP backbone's joint embedding space post-adaptation. While extensive experimentations demonstrate the robust elimination of label and feature noise, a plausible implication is that highly non-visual tasks or those with extreme class imbalance may require further prompt engineering or structural modifications.
The paradigm of grounding data selection in multimodal, contrastively pre-trained models marks a notable shift towards leveraging large-scale, generalist feature spaces for core data-centric decisions. Future extensions may include integration with continuous data curation pipelines, active learning couplings, or porting to richer multimodal backbones.
In conclusion, CLIP-SPM establishes a principled, scalable, and robust protocol for static supervised data subset selection, uniting the strengths of multimodal supervision, lightweight domain adaptation, and differentiable combinatorial optimization. It enables practitioners to increase model robustness, reduce computational overhead, and ultimately improve the quality and reliability of downstream supervised learning systems (Yang et al., 15 Oct 2024).