Decomposed CLIPScore (dCS): Fine-Grained Evaluation
- The paper introduces a method that decomposes global CLIPScore into token-level and noun-level contributions using masking and conformal risk control.
- dCS identifies misaligned words and objects by analyzing the impact of masking tokens or patches, enabling precise error localization in caption evaluation.
- The approach extends traditional CLIPScore with robust uncertainty estimation and risk calibration, significantly improving hallucination detection and model curation.
Decomposed CLIPScore (dCS) is a class of CLIP-based evaluation metrics designed to provide fine-grained, reference-free semantic alignment assessment between images and candidate captions. By decomposing the standard global CLIPScore into word- or phrase-level (especially noun-level) contributions or by analyzing the impact of masking tokens/patches on compatibility scores, dCS enables granular error localization and supports robust uncertainty calibration in caption quality evaluation. Recent formulations extend this notion with conformal risk control and noun-centric granularity, enabling both token-level error detection and improved robustness for detecting object hallucination in vision-LLMs.
1. Foundations: CLIPScore and its Limitations
CLIPScore, as introduced by Hessel et al., measures the cosine similarity between CLIP’s fully normalized image () and caption () embeddings, rescaled into :
CLIPScore correlates well with human preferences on caption conformity, yet it operates solely at the global (sentence-image) level, providing no insight into the contribution or error localization of individual words and failing to disambiguate object hallucination or partial misalignment in composite scenes. The original metric does not attempt per-dimension, per-layer, or per-word decompositions (Hessel et al., 2021).
2. Per-Word/Token Decomposition: Masking-Based dCS
Decomposed CLIPScore (dCS) as formulated by conformal risk-control approaches operationalizes token-level score attribution via randomized masking and difference analysis (Gomes et al., 1 Apr 2025). The method involves the following key steps:
- Input: Let be an image, its candidate caption, and , the -normalized CLIP encodings.
- Perturbed Masking: For sets of tokens randomly masked in the caption (and potentially analogous patch masking in the image), compute the change in CLIPScore:
- Token Attribution: Aggregate for each token over all masking trials in which :
- Sigmoid Mapping: Final per-token dCS scores are given by , where higher values indicate misalignment or likely errors.
This dCS approach supports granular identification of erroneous words, as illustrated by masking “dog” in “A dog sits on a mat” paired with a cat image, yielding a high dCS for “dog” and highlighting its incompatibility (Gomes et al., 1 Apr 2025).
3. Noun-Level and Phrase-Level dCS: Fine-Grained Variants
A complementary line of work operationalizes dCS at the noun or noun-phrase level, arguing that object-centric granularity is critical for hallucination detection (Oh et al., 27 Feb 2025). The procedure is:
- Noun Extraction: Using POS/chunking (e.g., spaCy), extract nouns from the caption.
- Noun Embedding: Embed each noun .
- Noun–Image Similarity: Compute dCS as the mean cosine similarity between noun embeddings and the image embedding:
This noun-level dCS, and its combination with the global caption score in “Fine-grained CLIPScore” (F-CLIPScore), enables more sensitive detection of object hallucination, sharply discriminating between semantically proximate but divergent captions (Oh et al., 27 Feb 2025).
4. Risk Calibration and Uncertainty Estimation
To address the inherent uncertainty in CLIPScore and its per-word decompositions, conformal risk control frameworks are introduced (Gomes et al., 1 Apr 2025). The protocol involves:
- Risk Definition: Define a set-valued predictor .
- Risk Function Selection: Choose a monotonic risk (e.g., FDR or FPR) and target risk level with failure probability .
- Calibration: For a calibration set, compute empirical risk and derive an upper confidence bound ; then tune threshold to satisfy
The result is a provably controlled risk on flagged error tokens, offering reliability improvements over raw thresholding. Combining this with dCS supports both scalable token-level annotation and confidence intervals for the global CLIPScore.
5. Empirical Evaluation and Comparative Performance
Empirical studies demonstrate several advantages and applications of dCS:
| Task / Setting | Metric/Threshold | Performance Outcome |
|---|---|---|
| FOIL-it (multi-class) | FDR target 20% | Achieves FDR ≈ 20%, F1 ≈ 51.4% (test) |
| Rich-HF (multi-label) | FPR target 20% | F1 ≈ 38.0%, outperforming ALOHa, Rich-HF baselines |
| OHD-Caps (object halluc.) | dCS vs. CLIPScore | dCS/F-CLIPScore: 62.2% vs. 22.6% accuracy |
| POPE benchmark filtering | F-CLIPScore curation | Accuracy ↑ 4.9 pp after data filtering |
Noun-level dCS (and F-CLIPScore) exhibit substantial improvements in both hallucination detection and downstream model curation, outperforming global sentence-only CLIPScore by large margins (up to +39.6 percentage points) (Oh et al., 27 Feb 2025, Gomes et al., 1 Apr 2025).
6. Applications, Limitations, and Future Directions
dCS enables multiple practical avenues:
- Token/word error detection in candidate captions for interpretability or error correction pipelines.
- Object hallucination mitigation via noun-level scoring for LVLM data curation and loss design.
- Risk-calibrated quality estimation for trustable semantic scoring and downstream filtering.
Key limitations include dependency on POS/parsing robustness (especially outside English or in informal domains), neglect of relational/spatial errors (as all nouns are treated equally), and limited utility for verbs/adjectives or regional attribution. Proposed extensions include integrating region-level image embeddings, using cross-attention weighting, and adopting multilingual parsing strategies (Oh et al., 27 Feb 2025).
7. Relation to Other Evaluation Paradigms
Unlike reference-based metrics (e.g., CIDEr, SPICE), dCS operates in a reference-free paradigm, utilizing only pretrained CLIP encoders and not relying on ground-truth captions. Its decomposition strategies enable granular semantic assessment, which is infeasible for global metrics. There is no mention in the cited literature of any decomposition at the 512-dimensional embedding level or learned per-dimension weighting; all granularity is introduced via masking or linguistic span selection rather than altering CLIP’s internal representations (Hessel et al., 2021, Gomes et al., 1 Apr 2025, Oh et al., 27 Feb 2025).
In summary, Decomposed CLIPScore constitutes a family of techniques for attributing semantic misalignment and estimating uncertainty at the token or object level in image-to-text assessment tasks. These methods utilize CLIP’s robust cross-modal alignment, require no special training or reference data, and have demonstrated empirical utility in both error localization and reducing vision-LLM hallucination.